The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore De novo transcriptome assembly using Trinity

View in Fullscreen

De novo transcriptome assembly using Trinity Robert Bukowski, Qi Sun Bioinformatics Facility Institute of Biotechnology Slides: http://cbsu.tc.cornell.edu/lab/doc ...

Like this book? You can publish your book online for free in a few minutes!

http://anyflip.com/qkfb/lpui/

Download PDF

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by , 2017-01-22 05:15:03

De novo transcriptome assembly using Trinity

Pages:

1 - 50
51 - 84

De novo transcriptome assembly using Trinity Robert Bukowski, Qi Sun Bioinformatics Facility Institute of Biotechnology Slides: http://cbsu.tc.cornell.edu/lab/doc ...

De novo tra
assembly u

Robert Buko
Bioinforma
Institute of Bi

Slides: http://cbsu.tc.cornell.edu/lab/doc
Exercise instructions: http://cbsu.tc.cornell.edu/lab/doc

anscriptome
using Trinity

owski, Qi Sun
atics Facility
iotechnology

c/Trinity_workshop_Part1.pdf
c/Trinity_exercise1_2016.pdf

Strategies for transcriptome ass

Reference-based RNA-Seq read
Read cleanu

Splice-aware alignment to
reference genome
TopHat, STAR

Transcript reconstruction
Cufflinks, Scripture

Post-assembly analysis
Assembly QC ass
Full-length transc
Abundance estim
Differential Expre
Protein coding re
Functional annot

sembly from RNA-Seq data

ds
De novo

up

De novo assembly of transcriptome
Oases
SOAPdenovo-trans
TransABYSS
Trinity

sessment
cript analysis
mation
ession analysis
egions identification
tation

Trinity: state of the art de novo RNA

Assembler

(today’s topic)
Developed at Broad Institute and Hebrew University of Jer
N. G Grabherr et. al., Nature Biotechnology 29, 644–652 (20
B. J. Haas et. al., Nature Protocols 8, 1494–1512 (2013) doi:

A-Seq assembly and analysis package

Variety of post-assembly tools

Assembly QC*
Abundance estimation*
Differential Expression analysis*
Protein coding regions identification
Functional annotation*
Full-length transcript analysis*

(*next session)

rusalem
011) doi:10.1038/nbt.1883
:10.1038/nprot.2013.084

As other short read assemblers, Trin

What are K-mers?

From: http://www.homolog.us/Tutorials/index.php

nity works in K-mer space (K=25)

What an assembler

K-mer extraction Path validation u
from reads (and reads
counting)

From: http://www.homolog.us/Tutorials/index.php

is trying to do

using

Complications:
• Sequencing errors add nodes (and

complexity)
• Short or long K-mers?
• No positional info on repeats (less

important for transcriptomes)

Three stages of Trinity

0. Jellyfish
• Extracts and counts K-mers (K=25) from reads

1. Inchworm:
• Assembles initial contigs by “greedily” extending
sequences with most abundant K-mers

2. Chrysalis:
• Clusters overlapping Inchworm contigs, builds de
Bruijn graphs for each cluster, partitions reads
between clusters

3. Butterfly:
• resolves alternatively spliced and paralogous
transcripts independently for each cluster (in
parallel)

From: N. G Grabherr et. al., Nature Biotechnology

y 29, 644–652 (2011) doi:10.1038/nbt.1883

Trinity p

Trinity proper
• Trinity (perl script to glue it all together)
• Inchworm
• Chrysalis
• Butterfly (Java code – needs Java 1.7)
• various utility and analysis scripts (in perl)

Bundled third-party software
• Trimmomatic: clean up reads by trimming and removing
• Jellyfish: k-mer counting software
• Fastool: fasta and fastq format reading and conversion (F
• ParaFly: parallel driver (Broad Institute)
• Slclust: a utility that performs single-linkage clustering wi
weakly bound clusters into distinct clusters (Brian Haas)
• Collectl : system performance monitoring (Peter Seger)
• Post-assembly analysis helper scripts (in perl)

External software Trinity depends on (needs to be in the search PA
• samtools
• Bowtie
• RSEM, eXpress: alignment-based abundance estimation (
• kallisto, salmon: alignment-free abundance estimation
• Transcoder: identify candidate coding regions in within tr

programs

adapter remnants (Bolger, A. M., Lohse, M., & Usadel, B)
Francesco Strozzi)
ith the option of applying a Jaccard similarity coefficient to break

ATH):
(Bo Li and Colin Dewey)
ranscripts (Brian Haas - Broad, Alexie Papanicolaou – CSIRO)

Notation convention used

Trinity commands will be abbreviated using the variable TRI
TRINITY_DIR=/programs/trinityrnaseq-2.2.0
(this is the latest version installed on BioHPC Lab)

Thus, a command
$TRINITY_DIR/Trinity [options]
really means
/programs/trinityrnaseq-2.2.0/Trinity [op

The same notation is used in auxiliary shell scripts containing

d on the following slides

INITY_DIR to denote the location of the Trinity package
0

ptions]
g Trinity commands, used in the Exercise

Pipeline complicat

(in principle, just

Assess read quality, e.g., with Input file(s) with RNA
fastqc reads_L.fa, rea
FASTA or FASTQ forma

Trinity command, e.g.,

$TRINITY_DIR/Trinity –seqType fa –left reads_L

May include read clean-up and/or normalization

Program monitori

Trinity.f

Final output: FASTA file with

ted, but easy to run

a single command)

A-Seq reads, e.g.,
ads_R.fa
at

L.fa –right reads_R.fa –max_memory 10G –CPU 8

ing, log files

fasta

h assembled transcripts

Input: RNA-S

FASTA format: 2 lines for each read (“>name”, sequence)
>61DFRAAXX100204:1:100:10494:3070
ACTGCATCCTGGAAAGAATCAATGGTGGCCGGAAAGTGTTTT

FASTQ format: 4 lines per read (“@name”, sequence, “+”, qua
@61DFRAAXX100204:1:100:10494:3070
ACTGCATCCTGGAAAGAATCAATGGTGGCCGGAAAGTGTTTT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC

ASCII code of a letter in quality string - 33 equals phred qualit

For example, “B” stands for: 66 – 33 = 33, i.e., probability of t

Base qualities are ignored by the assembler.

Seq reads

TTCAAATACAAGAGTGACAATGTGCCCTGTTGTTT

ality string)
TTCAAATACAAGAGTGACAATGTGCCCTGTTGTTT
C@@CACCCCCACCCCCCCCCCCCCCCCCCCCCCCC
ty score of the corresponding base.
the base (here: G) being miscalled is 10-3.3.

Input: RNA-S

Paired-end case: we have two “parallel” files – one for “left”

First sequence in “left” file

@61DFRAAXX100204:1:100:10494:3070/1
ACTGCATCCTGGAAAGAATCAATGGTGGCCGGAAAGTGTTTTTCAAA
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CA

First sequence in “right” file

@61DFRAAXX100204:1:100:10494:3070/2
CTCAAATGGTTAATTCTCAGGCTGCAAATATTCGTTCAGGATGGAAG
+
C < CCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Note1 : Read naming convention: /1 and /2 in front of the same name
Note2 : Cassava 1.8 names will be handled properly (no re-formatting

@HWI-ST896:156:D0JFYACXX:5:1101:1652:2132 1:N:0

will be treated as

HWI-ST896:156:D0JFYACXX:5:1101:1652:2132/1

Seq reads

” and another for “right” end of the fragment:

ATACAAGAGTGACAATGTGCCCTGTTGTTT
ACCCCCACCCCCCCCCCCCCCCCCCCCCCCC

GAACATTTTCTCAGTATTCCATCTAGCTGC
CCCBCCCCCCCCCCCCCCCCACCCCCACCC =
e core indicate read pairing
g needed), e.g.,
0:GATCAG

Trinity com

Suppose we have paired-end reads from one individual and 4 c

Sp_ds.left.fq.gz, Sp_hs.left.fq.gz, Sp_log.lef
Sp_ds.right.fq.gz, Sp_hs.right.fq.gz, Sp_log.rig

(In this example, the FASTQ files have been compressed with gzip; unco

A Trinity command may be long and tedious to type. It is conven
bash script, like this (note: “\” characters break one long line in

#!/bin/bash

$TRINITY_DIR/Trinity --seqType fq \
--left Sp_ds.left.fq.gz,Sp_hs.left.fq.gz,Sp
--right Sp_ds.right.fq.gz,Sp_hs.right.fq.gz
--SS_lib_type RF \
--max_memory 2G \
--CPU 2 \
--output /workdir/bukowski/trinity_out
Save the script (e.g., as my_trinity_script.sh) and run

nohup ./my_trinity_script.sh

mmand line

conditions in FASTQ files

ft.fq.gz, Sp_plat.left.fq.gz,
ght.fq.gz, Sp_plat.right.fq.gz
ompressed files can also be used.)

nient to create (using a text editor, like nano or vi) a
nto shorter pieces):

p_log.left.fq.gz,Sp_plat.left.fq.gz \
z,Sp_log.right.fq.gz,Sp_plat.right.fq.gz \

File names separated by “,” but NO SPACES!!!

it, redirecting any screen output to a file on disk:
>& my_trinity_script.log &

Basic Trinity comm

#!/bin/bash

$TRINITY_DIR/Trinity --seqType fq \
--left Sp_ds.left.fq.gz,Sp_hs.left.fq.gz,S
--right Sp_ds.right.fq.gz,Sp_hs.right.fq.g
--SS_lib_type RF \
--max_memory 2G \
--CPU 2 \
--output /workdir/bukowski/trinity_out

NOTES:
• Use all reads from an individual (all conditions) to capture mo
• Read files may be gzipped (as in this example) or not (then th
• Paired-end reads specified with --left and --right. If

• 2G is the maximum memory to be used at any stage which al
• At most 2 CPU cores will be used in any stage
• Final output and intermediate files will be written in director
• --SS_lib_type RF: The PE fragments are strand-specific

Forward strand of the sequenced mRNA template
• For non-strand specific reads, just skip the option --SS_

mand dissected…

Sp_log.left.fq.gz,Sp_plat.left.fq.gz \
gz,Sp_log.right.fq.gz,Sp_plat.right.fq.gz \

ost genes
hey should not have the “.gz” ending)
f only single-end, use --single instead.
llows memory limitation (jellyfish, sorting, etc.)
ry /workdir/bukowski/my_trinity_out
c, with left end on the Reverse strand and the right end on
_lib_type

Strand-spe

sense

--SS_lib_type
SE cases
PE cases

Strand specificity slightly increases computation time (more K-m
• Helps disambiguate between overlapping genes on opposite
Most RNA-Seq protocols now in use are strand specific

ecific RNA-Seq

anti-sense

e.g., dUTP/UDG protocol
mers), but is very helpful for de novo assembly
e strands of DNA, sense and nonsense transcripts

Basic Trinity command

#!/bin/bash

$TRINITY_DIR/Trinity --seqType fq \
--single Sp_ds.fq.gz,Sp_hs.fq.gz,Sp_log.fq
--SS_lib_type R \
--max_memory 2G \
--CPU 2 \
--output /workdir/bukowski/my_trinity_out

Basic Trinity command with

#!/bin/bash

$TRINITY_DIR/Trinity --seqType fq \
--left Sp_ds.left.fq.gz,Sp_ds.leftUnpaired
--right Sp_ds.right.fq.gz,Sp_ds.rightUnpai
--SS_lib_type RF \
--max_memory 2G \
--CPU 2 \
--output /workdir/bukowski/my_trinity_out

d with single-end reads

q.gz,Sp_plat.fq.gz \

h mixture of PE and SE reads

d.fq.gz \
ired.fq.gz \

Final output: assem

FASTA file called Trinity.fasta (located in the run outpu

FASTA headers contain gene/isoform ID for each transcript, as d
>TRINITY_DN0_c0_g2_i1 len=995 path=[1183:0-52 1184:53-994
>TRINITY_DN30_c0_g1_i1 len=2677 path=[2675:0-1921 2676:19
>TRINITY_DN30_c0_g2_i1 len=1940 path=[2673:0-1921 2674:19
>TRINITY_DN2_c0_g1_i1 len=386 path=[402:0-254 403:255-385
>TRINITY_DN2_c0_g2_i1 len=279 path=[404:0-254 405:255-278
>TRINITY_DN3_c0_g1_i1 len=1286 path=[2096:0-1271 2097:127
>TRINITY_DN3_c0_g1_i2 len=2102 path=[2094:0-1271 2095:127

TRINITY_DN30_c0_g2 is a gene identifier
i1 is an isoform identifier

The rest of the header shows nodes of de Buijn graph travers

mbled transcriptome

ut directory)
determined by Trinity
4] [-1, 1183, 1184, -2]
922-2676] [-1, 2675, 2676, -2]
922-1939] [-1, 2673, 2674, -2]
5] [-1, 402, 403, -2]
8] [-1, 404, 405, -2]
72-1285] [-1, 2096, 2097, -2]
72-2101] [-1, 2094, 2095, -2]

sed by the transcript.

Advanced (but importan
conside

nt) Trinity options to
er

General issues with de novo transcr

RNA-Seq reads often need pre-processing before assem
• remove barcodes and Illumina adapters from r
• clip read ends of low base quality
• remove contamination with other species

Very non-uniform coverage
• highly vs lowly expressed genes
• high-coverage in some regions may imply more
road
• some “normalization” of read set needed

Gene-dense genomes pose a challenge (assembly may
• overlapping genes on opposite strands (strand
• some overlap of genes on the same strand (ha

riptome assembly of RNA-Seq data

mbly
reads

e sequencing errors complicating assembly down the

y produce chimeric transcripts)
d-specific RNA-Seq protocols may help)
arder to handle)

Pre-assembly read

Trimmomatic: A flexible read trimming tool for Illumina NGS d
http://www.usadellab.org/cms/?page=trimmomatic)

java -jar /programs/Trimmomatic-0.36/trimm
left.fq.gz right.fq.gz \
left.fq.gz.P.qtrim left.fq.gz.U.qtrim \
right.fq.gz.P.qtrim right.fq.gz.U.qtrim \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDING

Filtering operations (in order specified) performed on each re

• Remove Illumina adapters (those in file TruSeq3-PE.f
• Clip read when average base quality over a 4bp sliding win
• Clip leading and trailing bases if base quality below 5
• Skip read if shorter than 25bp

clean-up option

data (Bolger et al.,

momatic-0.36.jar PE -threads 2 -phred33 \

GWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

ead:
fa) using “palindrome” algorithm
ndow drops below 5

Pre-assembly read clea

Output files (gzipped if requested file name ends with “.gz”):
• left.fq.gz.P.qtrim, right.fq.gz.P.qtrim:

fragment survived filtering
• left.fq.gz.U.qtrim, right.fq.gz.U.qtrim:

the mate did not

Trimmomatic is bun

To invoke from within Trinity, add option

--trimmomatic

Also, one can request specific Trimmomatic settings by adding

--quality_trimming_params “ILLUMINACLIP:Tru
LEADING:20 TRAILING:10 MINLEN:30”

Trinity will use all surviving reads, treating the un-paired ones

an-up: Trimmomatic

“parallel” files of reads such that both ends of a
“left” and “right” reads that survived filtering, although

ndled with Trinity

g something similar to
uSeq3-PE.fa:2:30:10 SLIDINGWINDOW:6:5
s as single-end.

How to choose trimming pa
assessment w

Many tests are
expected to fail
due to
RNA_Seq
characteristics

Run fastqc before running T

arameters: Read quality
with fastqc

Trinity (part of the Exercise)

Dealing with very dee

• Done to identify genes with low expression
• Several hundreds of millions of reads involved
• More sequencing errors possible with large depths, increa

Suggested Trinity options/treatments:
• Use option

--min_kmer_cov 2
(singleton K-mers will not be included in initial Inchworm
• Perform K-mer based insilico read set normalization (--

• May end up using just 20% of all reads reducing com
quality

ep sequencing data

asing the graph complexity

m contigs)
-normalize_reads)
mputational burden with no impact on assembly

K-mer based rea

Sample the reads (fragments) with probability

= min(1, )
Where is the target K-mer coverage (k=25) and is the me
both fragment ends). Typically, = 30 − 50.

(Also: filter out reads for which STDEV of K-mer coverage exc

Effect: poorly covered regions unchanged, but reads down-sa

To invoke as a part of Trinity pipeline, add option

--normalize_reads –-normalize_max_read_cov

on Trinity command line (adjust target “30” to your needs)

To run read normalization separately, use the utility script:
$TRINITY_DIR/util/insilico

(run without arguments to see available options)
This normalization method has “

ad normalization

edian K-mer abundance along the read (or average over
ceeds )
ampled in high-coverage regions.
v 30

o_read_normalization.pl
“mixed reviews”

Avoiding chimeric transcripts

• Use strand-specific RNA-Seq protocol
• Try option –jaccard_clip (time-consuming; no effec

with IGV after DE calculation for important genes)

From: B. J. Haas et. al., Nature Protocols 8, 1494–1512 (2013) doi:10.1038/nprot.2013.084

s from gene-dense genomes

ct in case of gene-sparse genomes, maybe just check

Resource requ

How many reads are needed?
How long will the assembly take?
How much RAM is needed?

uirements

?

Pages:

1 - 50
51 - 84

Click to View FlipBook Version