The Cold Spring Harbor Lab (CSHL) small RNA track depicts short total RNA sequencing data from ENCODE tissues or sub-cellular compartments of ENCODE cell lines. The protocol used to generate these data produced directional reads from the 5' end of short RNAs, RNAs shorter than 200 nucleotides in length. Libraries were sequenced using an Illumina GAIIx. These data were generated by Cold Spring Harbor Laboratories as a part of the ENCODE Consortium. The ENCODE project seeks to identify and characterize all functional elements in the human genome. In many cases there are datasets of Cap Analysis of Gene Expression (CAGE, see the RIKEN CAGE Loc track), Long RNA-seq (RNAs longer than 200 nucleotides, see the CSHL Long RNA-seq track) and Pair-End di-TAG-RNA (PET-RNA, see the GIS RNA PET track) available from the same biological replicates.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types
(views). For each view, there are multiple subtracks that
display individually on the browser. Instructions for configuring multi-view
tracks are here.
To show only selected subtracks, uncheck the boxes next to the tracks that
you wish to hide. Color differences among the views provide a visual cue for
distinguishing between the different cell types and compartments.
This track contains the following views:
- The Contigs are BED format files representing blocks of overlapping mapped reads from pooled biological replicates. The corresponding number of mapped reads, the RPKM (Reads Per kb per Million reads) value, and the non-parametric Irreproducible Discovery Rate (np-IDR) are reported for each contig.
- Plus and Minus Signal
- The Signal view shows the density of mapped reads on the plus and minus strands (wiggle format).
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
More views may be found on the Downloads Page.
Cells were grown according to the approved ENCODE cell culture protocols. Short RNAs between 20 and 200 nucleotides were isolated from total RNA using a Qiagen RNeasy kit (Qiagen #74204) according to the manufacturer's protocol. Purified small RNAs were depleted of ribosomal RNA. To clone different populations the RNA was either left untreated (5' monophosphate RNAs), treated with Tobacco Alkaline Pyrophosphatase (both 5' monophosphate and capped RNAs), or treated with Calf Intestinal Alkaline Phosphatase followed by Tobacco Alkaline Pyrophosphatase (capped RNAs) prior to ligation of a 5' linker.
The 3' ends were polyadenylated in vitro (or polycytidylated in the case of Generation 0 data) using Poly-A Polymerase.
Anchored oligo-dT was used to prime the reverse transcriptase reaction and sequencing compatible ends were added in a subsequent PCR step. The libraries were sequenced on the Illumina GAIIx of Hi-Seq from the 5' ends for a total of either 36 or 101 cycles.
Complete protocols are available in the Downloads Page.
Data Processing and Analysis
Data from the Gingeras and Guigo labs were preprocessed to remove experimentally derived poly-A tails and Illumina 3' linkers from raw reads. The best alignment to the Illumina 3' linker for each read was determined. If the number of mismatches in the alignment was less than 20% of the aligned length, the read was clipped from the first aligned base. Pre-processed reads were mapped using the STAR algorithm. For a description of STAR, the source code and mapping parameters used, see the STAR project website. Reads mapping 10 times or less are reported in the Signal and Alignment files.
Mapped reads were discarded if they fell into one of the following categories: 1) it contained five or more consecutive A's, 2) it was less than 16 nucleotides in length, 3) it mapped to more than one genomic position (multiply-mapped reads), 4) it mapped upstream of genomically encoded poly-A sequences. The remaining reads were used both to call contigs and to produce expression values over GENCODE V7 exons. Contigs were generated from overlapping reads in pooled biological replicates.
Generation 0 data:
Reads were trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3' adapter sequence into the 3' end of the reads. The 3' sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to two mismatches and no indels (insertions or deletions).. Regions that aligned were clipped off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure were also trimmed. Reads were aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B et al., 2009). Reads that mapped 20 or fewer times with two or less mismatches were reported. See Release Notes for more information on Generation 0 datasets.
The mapped data were visually inspected to verify the majority of the reads were mapping the 5' ends of annotated small RNA classes.
This is Release 3 (July 2012) of CSHL Small RNA-seq with new data from the Gingeras lab. It includes twenty-two new cell lines:
CD20+, CD34+_Mobilized, HAoAF, HAoEC, HCH, HFDPC, HMEpC, hMSC-AT, hMSC-UC, HOB, HPC-PL, HPIEpC, HSaVEC, HVMF, HWP, IMR90, Monocytes-CD14+, NHDF, NHEM.f_M2, NHEM_M2, SkMC, SK-N-SH.
There are 53 new experiments in total. Release 3 data includes two new variations in protocol (CIP-TAP and untreated) to create different RNA populations.
Many of the datasets produced by the Hannon lab (Generation 0 datasets) in Release 1 have been replaced by newly generated data from the Gingeras lab in Release 2. Of all Generation 0 datasets, only data from K562 and Prostate tissue are still displayed. All Generation 0 datasets are still available for download.
Discrepancies between hg18 and hg19 versions of Generation 0 CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly in that we no longer retain information regarding whether a read was clipped off an adapter sequence.
Hannon lab members: Katalin Fejes-Toth, Vihra Sotirova, Gordon Assaf, Jon Preall
Gingeras and Guigo laboratories: Carrie A. Davis, Lei-Hoon See, Wei Lin
- Jonathan Preall (Generation 0 Data from Hannon Lab)
Carrie Davis (experimental)
Alex Dobin (computational)
Wei Lin (computational)
Tom Gingeras (primary investigator)
Fejes-Toth K, Sotirova V, Sachidanandam R, Assaf G, Hannon GJ, Kapranov P, Foissac S, Willingham AT, Duttagupta R, Dumais E, Gingeras TR. Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs. Nature. 2009;457(7232):1028-32.
Langmead B, Trapnell C, Pop M, Salzberg SL.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Genome Biol. 2009;10(3):R25.
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset until
nine months following the release of the dataset. This date is listed in
the Restricted Until column in the track configuration page and the
download page. The full data release policy
for ENCODE is available