Supplementary information

Supplementary material for the manuscript titled:

Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation

Authors: Marija Cvijovic, Daniel Dalevi, Elizabeth Bilsland, Graham Kemp, and Per Sunnerhagen

The scripts

Dependencies

The following software needs to be installed:

getorf for finding ORFs. getorf is a part of the emboss software
SNAP for calculating synonymous and non-synonymous substitution rates.

ClustalW for producing multiple alignments.
AlignX for inspecting multiple alignments.
ORFviewer, a JAVA program specifically made for this study.
Expert system. For details see article and references therein.

The following folders should be created:

scripts Contains all perl-scripts.
data Contains all raw data files. That is,
1. Sbay_utr5_2000.fasta
2. Scas_utr5_2000.fasta
3. Sklu_utr5_2000.fasta
4. Skud_utr5_2000.fasta
5. Smik_utr5_2000.fasta
6. Spar_utr5_2000.fasta
7. Scer_extended.fasta
8. cer1_inter_bg.txt
9. genes.cer1(genes in file cer1_inter_bg.txt)
10. genes.ext (genes in Scer_extended.fasta)
11. genes_in_cer_not_ext
12. genes_in_ext_not_cer
results Empty at start but will include outputs.
root-dir The root directory. For example, the homedir where the other directories are. The file commands.bash should be put here.
uORF_tmp Store temporary files that can be removed after a complete run of all steps.
alignX must be emptied at beginning, see step 9.
newFolder Contains alignment of whole S-genomes including coding regions. Click here for download.

The steps

Subtract a region between, for example 300 and 350 nt, from file out.ffn (see Step 10 on how to create out.ffn), and put the result in file results.ffn.

perl scripts/splitOutputFromGetorf.pl uORF_tmp/out.ffn results/results.ffn 300 350
Run the expert system using the perl-script orf2exp.pl and the bash-file commands.bash and rules.txt. All results are stored in the file results.txt

perl scripts/orf2exp.pl results/results.ffn expert_infile.txt results/results.txt
Produce lists of candidates (list_cand.txt) larger than threshold (argument).

bash scripts/produceList.bash threshold

For example,

bash scripts/produceList.bash 0
Produce a priority list for manual inspection of the highest candidates sorted according to max CF-values of a gene

perl scripts/producePriorityList.pl results/results.txt results/list_cand.txt results/list_priority

Output-file will be: results/list_priority.srt
Run getorf on each of the candidates in the lis (list_cand.txt) on the utr-genome files.

perl scripts/utr2getorf.pl data/cer2.ffn data/Sbay_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sbay 10

perl scripts/utr2getorf.pl data/cer2.ffn data/Spar_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Spar 10

perl scripts/utr2getorf.pl data/cer2.ffn data/Smik_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Smik 10

perl scripts/utr2getorf.pl data/cer2.ffn data/Scas_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Scas 10

perl scripts/utr2getorf.pl data/cer2.ffn data/Skud_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Skud 10

perl scripts/utr2getorf.pl data/cer2.ffn data/Sklu_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sklu 10

10 = number of extra nucleotides to include (can be any number)
For Scer only use following script:

perl scripts/out2orfplot.pl uORF_tmp/out.ffn results/toOrfPlot.Scer results/list_cand.txt
Run addCF to add scores from expert system to create new output file, named, toOrfPlot.scer_mod. THIS IS ONLY DONE TO SCER.

perl scripts/addCF.pl results/results.txt results/toOrfPlot.Scer
Creating file plot.txt (this is the input file for uORFviewer, see Dependencies, that plots uORFs)

cat results/toOrfPlot.Scer_mod toOrfplot.Spar toOrfplot.Smik toOrfplot.Skud toOrfplot.Sbay toOrfplot.Scas toOrfplot.Sklu > results/plot.txt
Produce multifasta-files for all file in list_cand.txt. IMPORTANT, need the directory alignX which should be EMPTY otherwise program will append to already exsisting files.

rm alignX/*

perl scripts/utr2multifasta.pl results/list_cand.txt data/cer2.ffn data/Sbay_utr5_2000.fasta data/Spar_utr5_2000.fasta data/mik_utr5_2000.fasta data/Scas_utr5_2000.fasta data/Skud_utr5_2000.fasta data/Sklu_utr5_2000.fasta
Steps 10 and 11 can be run only ones to make cer2.ffn file (this file contains approx. 5800 S.cer genes with corensponding intergenic seq.) and out.ffn file (this file contins uORFs for each S.cer gene).

Producing file cer2.ffn (this file contains data from Kellis set and form harvard set) First create the file "cer1.ffn"

perl scripts/inter2fasta.pl data/cer1_inter_bg.txt > data/cer1.ffn

To add the extra entries from Scer_extended.fasta to cer1.ffn, we created an extra script called processExtra.pl which takes a list of gene-names (genes_in_ext_not_in_cer) and the Scer_extended,fasta. The output is a fasta file with the extra entries.

perl scripts/processExtras.pl data/Scer_extended.fasta data/genes_in_ext_not_cer.ffn data/genes_in_ext_not_cer

cat data/cer1.ffn data/genes_in_ext_not_cer.ffn | grep \. > data/cer2.ffn
Run getorf to create file out.ffn (this file contins uORFs for each S.cer gene)

getorf -minsize 0 -sequence data/cer2.ffn -outseq uORF_tmp/out.ffn -find 3 -reverse false
The software SNAP.pl was used for calculating synonymous and non-synonymous substitution rates.

Last modified: Thu Oct 12 11:19:06 CEST 2006