COME is designed to calculate COding potential from Multiple fEatures for transcripts.
COME accectps a gtf-format file as input, predicts the input transcripts as either coding ones or non-coding ones. It integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes COME’s performance more accurate and robust than other well-known tools, for transcripts with different lengths and assembly qualities. First, COME compose the feature matrix for the given transcripts using the pre-calculated features vectors. Second, COME predict the coding potential by the pre-trained models, using the feature matrix generated in the first step. COME is currently pre-trained for five model species: human (hg19), mouse (mm10), fly (dm3), worm (ce10) and plant (TAIR10).
II. Input file
The input gtf file should be:
1) as the description from UCSC gtf file.
2) Chromosome names should be in lower and abbreviate case, e.g. (chr1, chrX, etc), except for worm genome, which used roman number: chrI, chrII, chrIII, chrIV, chrX, chrY.
3) Only exon is allowed in the third column.
4) Only + or - are allowed in the seventh column.
5) gene_id and transcript_id must be provided with every transcript in gtf.
6) Transcript length must be longer than 50 nucleotides.
7) Other lines will be skipped.
III. Calculation Time
Runing time (seconds)
[ 0.2k, 0.5k) 1000 63 ± 5
[ 0.5k, 1.0k) 1000 88 ± 27
[ 1.0k, 1.5k) 1000 81 ± 31
[ 1.5k, 2.0k) 1000 68 ± 10
random sampled 10 38 ± 13
random sampled 100 72 ± 18
random sampled 1000 76 ± 8