Supplementary material for the manuscript titled:

Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation

Authors: Marija Cvijovic, Daniel Dalevi, Elizabeth Bilsland, Graham Kemp, and Per Sunnerhagen



The scripts

Dependencies

The following software needs to be installed:
  1. getorf for finding ORFs. getorf is a part of the emboss software
  2. SNAP for calculating synonymous and non-synonymous substitution rates.
  3. ClustalW for producing multiple alignments.
  4. AlignX for inspecting multiple alignments.
  5. ORFviewer, a JAVA program specifically made for this study.
  6. Expert system. For details see article and references therein.
The following folders should be created:
  1. scripts Contains all perl-scripts.
  2. data Contains all raw data files. That is,
    1. Sbay_utr5_2000.fasta
    2. Scas_utr5_2000.fasta
    3. Sklu_utr5_2000.fasta
    4. Skud_utr5_2000.fasta
    5. Smik_utr5_2000.fasta
    6. Spar_utr5_2000.fasta
    7. Scer_extended.fasta
    8. cer1_inter_bg.txt
    9. genes.cer1(genes in file cer1_inter_bg.txt)
    10. genes.ext (genes in Scer_extended.fasta)
    11. genes_in_cer_not_ext
    12. genes_in_ext_not_cer
  3. results Empty at start but will include outputs.
  4. root-dir The root directory. For example, the homedir where the other directories are. The file commands.bash should be put here.
  5. uORF_tmp Store temporary files that can be removed after a complete run of all steps.
  6. alignX must be emptied at beginning, see step 9.
  7. newFolder Contains alignment of whole S-genomes including coding regions. Click here for download.

The steps

  1. Subtract a region between, for example 300 and 350 nt, from file out.ffn (see Step 10 on how to create out.ffn), and put the result in file results.ffn.

    perl scripts/splitOutputFromGetorf.pl uORF_tmp/out.ffn results/results.ffn 300 350

  2. Run the expert system using the perl-script orf2exp.pl and the bash-file commands.bash and rules.txt. All results are stored in the file results.txt

    perl scripts/orf2exp.pl results/results.ffn expert_infile.txt results/results.txt

  3. Produce lists of candidates (list_cand.txt) larger than threshold (argument).

    bash scripts/produceList.bash threshold

    For example,

    bash scripts/produceList.bash 0

  4. Produce a priority list for manual inspection of the highest candidates sorted according to max CF-values of a gene

    perl scripts/producePriorityList.pl results/results.txt results/list_cand.txt results/list_priority

    Output-file will be: results/list_priority.srt

  5. Run getorf on each of the candidates in the lis (list_cand.txt) on the utr-genome files.

    perl scripts/utr2getorf.pl data/cer2.ffn data/Sbay_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sbay 10

    perl scripts/utr2getorf.pl data/cer2.ffn data/Spar_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Spar 10

    perl scripts/utr2getorf.pl data/cer2.ffn data/Smik_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Smik 10

    perl scripts/utr2getorf.pl data/cer2.ffn data/Scas_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Scas 10

    perl scripts/utr2getorf.pl data/cer2.ffn data/Skud_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Skud 10

    perl scripts/utr2getorf.pl data/cer2.ffn data/Sklu_utr5_2000.fasta results/list_cand.txt results/toOrfPlot.Sklu 10

    10 = number of extra nucleotides to include (can be any number)

  6. For Scer only use following script:

    perl scripts/out2orfplot.pl uORF_tmp/out.ffn results/toOrfPlot.Scer results/list_cand.txt

  7. Run addCF to add scores from expert system to create new output file, named, toOrfPlot.scer_mod. THIS IS ONLY DONE TO SCER.

    perl scripts/addCF.pl results/results.txt results/toOrfPlot.Scer

  8. Creating file plot.txt (this is the input file for uORFviewer, see Dependencies, that plots uORFs)

    cat results/toOrfPlot.Scer_mod toOrfplot.Spar toOrfplot.Smik toOrfplot.Skud toOrfplot.Sbay toOrfplot.Scas toOrfplot.Sklu > results/plot.txt

  9. Produce multifasta-files for all file in list_cand.txt. IMPORTANT, need the directory alignX which should be EMPTY otherwise program will append to already exsisting files.

    rm alignX/*

    perl scripts/utr2multifasta.pl results/list_cand.txt data/cer2.ffn data/Sbay_utr5_2000.fasta data/Spar_utr5_2000.fasta data/mik_utr5_2000.fasta data/Scas_utr5_2000.fasta data/Skud_utr5_2000.fasta data/Sklu_utr5_2000.fasta

  10. Steps 10 and 11 can be run only ones to make cer2.ffn file (this file contains approx. 5800 S.cer genes with corensponding intergenic seq.) and out.ffn file (this file contins uORFs for each S.cer gene).

    Producing file cer2.ffn (this file contains data from Kellis set and form harvard set) First create the file "cer1.ffn"

    perl scripts/inter2fasta.pl data/cer1_inter_bg.txt > data/cer1.ffn

    To add the extra entries from Scer_extended.fasta to cer1.ffn, we created an extra script called processExtra.pl which takes a list of gene-names (genes_in_ext_not_in_cer) and the Scer_extended,fasta. The output is a fasta file with the extra entries.

    perl scripts/processExtras.pl data/Scer_extended.fasta data/genes_in_ext_not_cer.ffn data/genes_in_ext_not_cer

    cat data/cer1.ffn data/genes_in_ext_not_cer.ffn | grep \. > data/cer2.ffn

  11. Run getorf to create file out.ffn (this file contins uORFs for each S.cer gene)

    getorf -minsize 0 -sequence data/cer2.ffn -outseq uORF_tmp/out.ffn -find 3 -reverse false

  12. The software SNAP.pl was used for calculating synonymous and non-synonymous substitution rates.

Last modified: Thu Oct 12 11:19:06 CEST 2006