Bioinformatics Manual

From Lu Lab Wiki
Jump to: navigation, search

Level 1 Databases and Basic Analysis

Data Type and Format

Basic and Useful Tools

bamtools, samtools, bedtools:

shared scripts:

Basic analysis and tools comparison

Reviews and comparison on transcriptome analysis and tools

Alamancos, G. P., et al. (2013). Methods to study splicing from high-throughput RNA Sequencing data. arXiv preprint arXiv:1304.5952.

Smith, D. R. (2013). RNA-Seq data: a goldmine for organelle research. Brief Funct Genomics.

Rung, J. and A. Brazma (2013). Reuse of public genome-wide gene expression data. Nat Rev Genet 14(2): 89-99.

Garber, M., et al. (2011). Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8(6): 469-477.

Chen, G., et al. (2011). Overview of available methods for diverse RNA-Seq data analyses. Sci China Life Sci 54(12): 1121-1128.

Oshlack, A., et al. (2010). From RNA-seq reads to differential expression results. Genome Biol 11(12): 220.


Engström, P.G., et al. (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10(12): 1185-91.

Lindner, R. and C. C. Friedel (2012). A Comprehensive Evaluation of Alignment Algorithms in the Context of RNA-Seq. PLoS One 7(12): e52403.

Grant, G. R., et al. (2011). Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27(18): 2518-2528.

Transcriptome reconstruction

Steijger, T., et al. (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12): 1177-84.

Clarke, K., et al. (2013). Comparative analysis of de novo transcriptome assembly. Sci China Life Sci 56(2): 156-162.

Ren, X., et al. (2012). Evaluating de Bruijn graph assemblers on 454 transcriptomic data. PLoS One 7(12): e51188.

Mundry, M., et al. (2012). Evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach. PLoS One 7(2): e31410.

Gene/Transcript Expression Calculation

htseq-count can count reads based on read pairs for paired-end reads.

 For paired-end data, does htseq-count count reads or read pairs? 
 The script is designed to count “units of evidence” for gene expression. 
 If both mates map to the same gene, this still only shows that one cDNA fragment originated from that gene. Hence, it should be counted only once.

Differential expression

Differential Expression

Soneson, C. and M. Delorenzi (2013). A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14(1): 91.

Rapaport, F., et al. (2013). Comprehensive evaluation of differential expression analysis methods for RNA-seq data. arXiv preprint arXiv:1301.5277.

Dillies, M. A., et al. (2012). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform.

Kvam, V. M., et al. (2012). A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99(2): 248-256.

Bullard, J. H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11: 94.

Alternative splicing

Alamancos, G. P., et al. (2013). Methods to study splicing from high-throughput RNA Sequencing data. arXiv preprint arXiv:1304.5952.

Public databases

Model species

UCSC Genome Browser : A database and browser of various genomes, including mouse and human

SGD : A database of yeast genes and genome

WormBase : A database of worm genes and genome

FlyBase : A database of fly genes and genome

MGI : A database of mouse genes and genome

OMIM : A catalog of human genes and genetic disorders

GeneCard : A database of human genes, their products and their involvement in diseases

Big projects

1000 Genomes Project

ENCODE Project

ENCODE Project at UCSC


Roadmap Epigenetics

The Cancer Genome Atlas

International Cancer Genome Consortium

Dataset resource


Stanford MicroArray Database



dbSNP : A database of SNPs for various organisms

cis-regulatory elements


Cistrome Project

RNA General

The RNA World Website : A collection of links to RNA-related information

RNA structure

PDB : The RCSB Protein Data Bank

NDB : The Nucleic Acid Database

RNA FRABASE 2.0 : RNA Fragments search engine & database

BPS : A database of RNA base-pair structures

RNAJunction : A database of RNA junction and kissing loop structures

NTDB : Thermodynamic data for nucleic acids

RNAmods : The RNA Modification Database

Modomics : Database of RNA Modifications

SCOR : Database of RNA motif structure, function, tertiary interactions and their relationships

RNA-protein/miRNA interactions

CLIPdb: An integrative resource of CLIP-seq studies (Lu Lab)

RBPDB : A database of RNA-binding specificities

CISBP-RNA : A database of RNA binding proteins and their motifs

PRIDB : A protein–RNA interface database

starBase : Protein-RNA and miRNA-target interaction maps

CLIPZ : A database and analysis environment for CLIP-seq data

MAASE : A database of alternative splicing

ASPicDB : A database of alternative splicing of human genes


Gene Ontology : A database of Gene Ontology (GO) numbers for various genes

KEGG : Databases of known pathways, genes, and reactions

Reactome : A pathway database

Large-scale networks

The BioGRID : A repository for various types of biological interaction datasets

GeneMANIA : Biological network integration

CellNet : Cell- and tissue-specific GRN

GIANT : A web portal for tissue-specific functional networks

Alon Lab Collection of Complex Networks

3D Structure

0. Download PDB

1. 3D Drawing tool (if wanna show solvent accessibility, can replace B factor column in PDB file with the DSSP result)

1) VMD

2) PyMOL education version (free)


2. Solvent accessibility calculation tool:


motif search tools

meme:  ;

RNAcontext - RNAmotif:



Level 2 Advanced Methods and Research Projects

Database, Server, Tool and Software

Basic NGS Analysis

Recommended tools

1. Basic aligner

  • Bowtie2: supports gapped, local, and paired-end alignment modes

2. Junction mapper:

  • TopHat: splice junction mapper for RNA-Seq reads using Bowtie.
  • TopHat2: advanced TopHat capable of finding novel/known splice sites using Bowtie2 by default.

3. Assembler

  • Cufflinks: assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
  • SLIDE: sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation

4. Differential expression analysis

  • DEGseq: recommended when having no biological replicates
  • Gfold: recommended when having no biological replicates
  • edgeR: recommended when having biological replicates
  • DEseq: recommended when having biological replicates

New Tools

  • HISAT(Mapping)
  • stringtie(Expression Level)
  • Ballgown(Differential Expression)

Comparison of tools

  • Aligner: Systematic evaluation of spliced alignment programs for RNA-seq data
  • Assembler: Assessment of transcript reconstruction methods for RNA-seq

Recommended Pipelines

 Tutorial for RNA-seq analysis @ Github

RNA Structure and RBP Network

Data Mining

Recommended Reviews

Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer 14, 299–313 (2014) Kristensen, V., ..., Anne-Lise Børresen-Dale

Machine learning applications in genetics and genomics. Nature Reviews Genetics 16, 321–332 (2015) Maxwell W. Libbrecht & William Stafford Noble