Data generation pipeline
View On GitHub
The data generation pipeline uses JBrowse perl scripts to output efficient JSON representations of GFF and FASTA data. The JBrowse perl scripts will be automatically installed by the apollo deploy
command, but can also be installed using install_jbrowse.sh
.
Running these scripts outputs
- WebApollo/bin
- WebApollo/src/perl5
- WebApollo/extlib
Where the bin/ directory contains the normal jbrowse perl scripts such as prepare-refseqs.pl and flatfile-to-json.pl. Refer to the troubleshooting guide if these files are not outputted.
DNA track setup with prepare-refseqs.pl
The first step to setup the genome browser is to load the reference genome data. We'll use the prepare-refseqs.pl
script to output to the data directory that we configured in config.properties or config.xml.
$ bin/prepare-refseqs.pl --fasta pyu_data/scf1117875582023.fa --out $JBROWSE_DATA_DIR
WebApollo track setup with add-webapollo-plugin.pl
After initializing the data directory, add the WebApollo plugin tracks using the add-webapollo-plugin.pl
. It takes a 'trackList.json' as an argument.
$ client/apollo/bin/add-webapollo-plugin.pl -i $JBROWSE_DATA_DIR/trackList.json
GFF3 pre-processing with split_gff_by_source.pl
Generating data from GFF3 works best by having a separate GFF3 per source type. If your GFF3 has all source types in the same file, as with the Pythium ultimum sample, then use the tools/data/split_gff_by_source.pl
script. We'll output the split GFF3 to some temporary directory (e.g. split_gff
).
$ mkdir split_gff
$ tools/data/split_gff_by_source.pl -i pyu_data/scf1117875582023.gff -d split_gff
If we look at the contents of WEB_APOLLO_SAMPLE_DIR/split_gff
, we can see we have the following files:
$ ls split_gff
blastn.gff est2genome.gff protein2genome.gff repeatrunner.gff
blastx.gff maker.gff repeatmasker.gff snap_masked.gff
We will load each file and create the appropriate tracks in the following steps.
GFF3 track setup with gene/transcript/exon/CDS/polypeptide with flatfile-to-json.pl
We'll start off by loading maker.gff
from the Pythium ultimum data. We need to handle that file a bit differently than the rest of the files since the GFF represents the features as gene, transcript, exons, and CDSs.
$ bin/flatfile-to-json.pl --gff split_gff/maker.gff --type mRNA --trackLabel maker --out $JBROWSE_DATA_DIR
We can also add styling to the track by changing the subfeatureClasses and className to use custom WebApollo CSS classes:
$ bin/flatfile-to-json.pl --gff split_gff/maker.gff \
--arrowheadClass trellis-arrowhead \
--subfeatureClasses '{"wholeCDS": null, "CDS":"brightgreen-80pct", "UTR": "darkgreen-60pct", "exon":"container-100pct"}' \
--className container-16px --type mRNA --trackLabel maker --out $JBROWSE_DATA_DIR
See the Customizing features section for more information on CSS styles.
GFF3 with match/match_part features with flatfile-to-json.pl
If your track uses match and match_part types instead of gene->mRNA->exon, you can load the track using the --type match argument.
We'll start off with blastn
results as an example.
$bin/flatfile-to-json.pl --gff split_gff/blastn.gff \
--arrowheadClass webapollo-arrowhead \
--subfeatureClasses '{"match_part": "darkblue-80pct"}'
--type match
--className container-10px --trackLabel blastn --out $JBROWSE_DATA_DIR
Generate searchable name index
Once data tracks have been created, you can generate a searchable index of names using the generate-names.pl script:
$ bin/generate-names.pl --verbose --out $JBROWSE_DATA_DIR
This script creates an index of sequence names and feature names in order to enable auto-completion in the navigation text box. If you have some tracks that have millions of features, consider using "--completionLimit 0" to disable the autocompletion which will save time.
BAM data
BAM files are natively supported so the file can be read (in chunks) directly from the server with no preprocessing.
To use this, copy the BAM+BAM index to $JBROWSE_DATA_DIR, and then use the add-bam-track.pl to add the file to the tracklist.
$ mkdir $JBROWSE_DATA_DIR/bam
$ cp pyu_data/simulated-sorted.bam $JBROWSE_DATA_DIR/bam
$ cp pyu_data/simulated-sorted.bam.bai $JBROWSE_DATA_DIR/bam
$ bin/add-bam-track.pl --bam_url bam/simulated-sorted.bam \
--label simulated_bam --key "simulated BAM" -i $JBROWSE_DATA_DIR/trackList.json
Note: the bam_url
parameter is a relative URL to the $JBROWSE_DATA_DIR. It is not a filepath!
BigWig data
WebApollo also has native support for BigWig files (.bw), so no extra processing of these files is required either.
To use this, copy the BigWig data into the WebApollo data directory and then use the add-bw-track.pl.
$ mkdir $JBROWSE_DATA_DIR/bigwig
$ cp pyu_data/*.bw $JBROWSE_DATA_DIR/bigwig
Now we need to add the BigWig track.
$bin/add-bw-track.pl --bw_url bigwig/simulated-sorted.coverage.bw \ `
--label simulated_bw --key "simulated BigWig"`</span>
Note: the bw_url
paramter is a relative URL to the $JBROWSE_DATA_DIR. It is not a filepath!
Customizing different annotation types (advanced)
After running add-webapollo-plugin.pl
, the annotation track will be added to trackList.json
. To change how the different annotation types look in the "User-created annotation" track, you'll need to update the mapping of the annotation type to the appropriate CSS class. This data resides in trackList.json
after running add-webapollo-plugin.pl
. You'll need to modify the JSON entry whose label is Annotations
. Of particular interest is the alternateClasses
element. Let's look at that default element:
"alternateClasses": {
"pseudogene" : {
"className" : "light-purple-80pct",
"renderClassName" : "gray-center-30pct"
},
"tRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"snRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"snoRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"ncRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"miRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"rRNA" : {
"className" : "brightgreen-80pct",
"renderClassName" : "gray-center-30pct"
},
"repeat_region" : {
"className" : "magenta-80pct"
},
"transposable_element" : {
"className" : "blue-ibeam",
"renderClassName" : "blue-ibeam-render"
}
},
For each annotation type, you can override the default class mapping for both className
and renderClassName
to use another CSS class. Check out the Customizing features section for more information on customizing the CSS classes.
Customizing features
The visual appearance of biological features in WebApollo (and JBrowse) is handled by CSS stylesheets with HTMLFeatures tracks. Every feature and subfeature is given a default CSS "class" that matches a default CSS style in a CSS stylesheet. These styles are are defined in src/main/webapps/jbrowse/plugins/WebApollo/jbrowse/track_styles.css
and src/main/webapps/jbrowse/plugins/WebApollo/css/webapollo_track_styles.css
. Additional styles are also defined in these files, and can be used by explicitly specifying them in the --className, --subfeatureClasses, --renderClassname, or --arrowheadClass parameters to flatfile-to-json.pl. See example above
WebApollo differs from JBrowse in some of it's styling, largely in order to help with feature selection, edge-matching, and dragging. WebApollo by default uses invisible container elements (with style class names like "container-16px") for features that have children, so that the children are fully contained within the parent feature. This is paired with another styled element that gets rendered within the feature but underneath the subfeatures, and is specified by the --renderClassname argument to flatfile-to-json.pl. Exons are also by default treated as special invisible containers, which hold styled elements for UTRs and CDS.
It is relatively easy to add other stylesheets that have custom style classes that can be used as parameters to flatfile-to-json.pl. For example, you can create $JBROWSE_DATA_DIR/custom_track_styles.css
which contains two new styles:
.gold-90pct,
.plus-gold-90pct,
.minus-gold-90pct {
background-color: gold;
height: 90%;
top: 5%;
border: 1px solid gray;
}
.dimgold-60pct,
.plus-dimgold-60pct,
.minus-dimgold-60pct {
background-color: #B39700;
height: 60%;
top: 20%;
}
In this example, two subfeature styles are defined, and the top property is being set to (100%-height)/2 to assure that the subfeatures are centered vertically within their parent feature. When defining new styles for features, it is important to specify rules that apply to plus-stylename and minus-stylename in addition to stylename, as WebApollo adds the "plus-" or "minus-" to the class of the feature if the the feature has a strand orientation.
You need to tell WebApollo where to find these styles by modifying the JBrowse config or the plugin config, e.g. by adding this to the trackList.json
"css" : "sample_data/custom_track_styles.css"
Or you can also instead add the custom_track_styles.css to src/main/webapp/plugins/WebApollo/css/
and then use the @import command in src/main/webapp/jbrowse/plugins/WebApollo/css/main.css
. Then you may use these new styles when loading tracks to flatfile-to-json.pl, for example:
bin/flatfile-to-json.pl --gff WEB_APOLLO_SAMPLE_DIR/split_gff/maker.gff
--getSubfeatures --type mRNA --trackLabel maker --webApollo
--subfeatureClasses '{"CDS":"gold-90pct", "UTR": "dimgold-60pct"}'
Bulk loading annotations to the user annotation track
GFF3
You can use the tools/data/add_transcripts_from_gff3_to_annotations.pl
script to bulk load GFF3 files with transcripts to the user annotation track. Let's say we want to load our maker.gff
transcripts.
$ tools/data/add_transcripts_from_gff3_to_annotations.pl \
-U localhost:8080/WebApollo -u web_apollo_admin -p web_apollo_admin \
-i WEB_APOLLO_SAMPLE_DIR/split_gff/maker.gff
The default options should be handle GFF3 most files that contain genes, transcripts, and exons.
You can still use this script even if the GFF3 file that you are loading does not contain transcripts and exon types. Let's say we want to load match
and match_part
features as transcripts and exons respectively. We'll use the blastn.gff
file as an example.
$ tools/data/add_transcripts_from_gff3_to_annotations.pl \
-U localhost:8080/WebApollo -u web_apollo_admin -p web_apollo_admin \
-i split_gff/blastn.gff -t match -e match_part
Look at the script's help (-h
) for all available options.
Congratulations, you're done configuring WebApollo!