Supplementary material for the manuscript titled:
Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation
Authors:
Marija Cvijovic, Daniel Dalevi, Elizabeth Bilsland, Graham Kemp, and Per Sunnerhagen
The scripts
Dependencies
The following software needs to be installed:
- getorf for finding ORFs. getorf is a part of the emboss software
- SNAP for calculating synonymous and non-synonymous substitution rates.
- ClustalW for producing multiple alignments.
- AlignX for inspecting multiple alignments.
- ORFviewer, a JAVA program specifically made for
this study.
- Expert system. For details see article and
references therein.
The following folders should be created:
- scripts Contains all perl-scripts.
- data Contains all raw data files. That is,
- Sbay_utr5_2000.fasta
- Scas_utr5_2000.fasta
- Sklu_utr5_2000.fasta
- Skud_utr5_2000.fasta
- Smik_utr5_2000.fasta
- Spar_utr5_2000.fasta
- Scer_extended.fasta
- cer1_inter_bg.txt
- genes.cer1(genes in file cer1_inter_bg.txt)
- genes.ext (genes in Scer_extended.fasta)
- genes_in_cer_not_ext
- genes_in_ext_not_cer
- results Empty at start but will include outputs.
- root-dir The root directory. For example, the homedir where
the other directories are. The file commands.bash should be
put here.
- uORF_tmp Store temporary files that can be removed after
a complete run of all steps.
- alignX must be emptied at beginning, see step 9.
- newFolder Contains alignment of whole S-genomes
including coding regions. Click here for download.
The steps
- Subtract a region between, for example 300 and 350 nt, from file
out.ffn (see Step 10 on how to create out.ffn), and put the result in file results.ffn.
perl scripts/splitOutputFromGetorf.pl uORF_tmp/out.ffn
results/results.ffn 300 350
- Run the expert system using the perl-script orf2exp.pl and the bash-file
commands.bash and rules.txt. All results are stored in the file
results.txt
perl scripts/orf2exp.pl results/results.ffn expert_infile.txt results/results.txt
-
Produce lists of candidates (list_cand.txt) larger than threshold (argument).
bash scripts/produceList.bash threshold
For example,
bash scripts/produceList.bash 0
-
Produce a priority list for manual inspection of the highest candidates sorted according to max CF-values of a gene
perl scripts/producePriorityList.pl results/results.txt results/list_cand.txt results/list_priority
Output-file will be: results/list_priority.srt
-
Run getorf on each of the candidates in the lis
(list_cand.txt) on the utr-genome files.
perl scripts/utr2getorf.pl data/cer2.ffn data/Sbay_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sbay 10
perl scripts/utr2getorf.pl data/cer2.ffn data/Spar_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Spar 10
perl scripts/utr2getorf.pl data/cer2.ffn data/Smik_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Smik 10
perl scripts/utr2getorf.pl data/cer2.ffn data/Scas_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Scas 10
perl scripts/utr2getorf.pl data/cer2.ffn data/Skud_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Skud 10
perl scripts/utr2getorf.pl data/cer2.ffn data/Sklu_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sklu 10
10 = number of extra nucleotides to include (can be any number)
-
For Scer only use following script:
perl scripts/out2orfplot.pl uORF_tmp/out.ffn results/toOrfPlot.Scer results/list_cand.txt
-
Run addCF to add scores from expert system to create new output
file, named, toOrfPlot.scer_mod. THIS IS ONLY DONE TO SCER.
perl scripts/addCF.pl results/results.txt results/toOrfPlot.Scer
-
Creating file plot.txt (this is the input file for uORFviewer, see Dependencies, that plots uORFs)
cat results/toOrfPlot.Scer_mod toOrfplot.Spar toOrfplot.Smik toOrfplot.Skud toOrfplot.Sbay toOrfplot.Scas toOrfplot.Sklu > results/plot.txt
-
Produce multifasta-files for all file in list_cand.txt. IMPORTANT, need the directory
alignX which should be EMPTY otherwise program will append to
already exsisting files.
rm alignX/*
perl scripts/utr2multifasta.pl results/list_cand.txt data/cer2.ffn data/Sbay_utr5_2000.fasta data/Spar_utr5_2000.fasta data/mik_utr5_2000.fasta data/Scas_utr5_2000.fasta data/Skud_utr5_2000.fasta data/Sklu_utr5_2000.fasta
- Steps 10 and 11 can be run only ones to make cer2.ffn file (this file contains approx. 5800 S.cer genes with corensponding intergenic seq.)
and out.ffn file (this file contins uORFs for each S.cer gene).
Producing file cer2.ffn (this file contains data from Kellis set and form harvard set)
First create the file "cer1.ffn"
perl scripts/inter2fasta.pl data/cer1_inter_bg.txt > data/cer1.ffn
To add the extra entries from Scer_extended.fasta to cer1.ffn, we
created an extra script called processExtra.pl which takes a list
of gene-names (genes_in_ext_not_in_cer) and the Scer_extended,fasta.
The output is a fasta file with the extra entries.
perl scripts/processExtras.pl data/Scer_extended.fasta data/genes_in_ext_not_cer.ffn data/genes_in_ext_not_cer
cat data/cer1.ffn data/genes_in_ext_not_cer.ffn | grep \. > data/cer2.ffn
-
Run getorf to create file out.ffn (this file contins uORFs for each S.cer gene)
getorf -minsize 0 -sequence data/cer2.ffn -outseq uORF_tmp/out.ffn -find 3 -reverse false
-
The software SNAP.pl was used for calculating synonymous and
non-synonymous substitution rates.
Last modified: Thu Oct 12 11:19:06 CEST 2006