Copyright (C) 2007 Paul Thomas This file may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. LIMITATION OF WARRANTY NOTHING IN THIS AGREEMENT WILL BE CONSTRUED AS A REPRESENTATION MADE OR WARRANTY GIVEN BY PAUL THOMAS OR ANY THIRD PARTY THAT THE USE OF DATA PROVIDED HEREUNDER WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK OR OTHER RIGHTS OF ANY THIRD PARTY. DATA IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. PAUL THOMAS MAKES NO WARRANTY THAT ITS DATA DOES NOT CONTAIN ERRORS. ############################################################################ PANTHER SNP scoring tools - Version 1.03 http://www.pantherdb.org/downloads/ 2/8/2013 ########## This tool estimates the likelihood of a particular nonsynonymous (amino-acid changing) coding SNP to cause a functional impact on the protein. It calculates the subPSEC (substitution position-specific evolutionary conservation) score based on an alignment of evolutionarily related proteins, as described in: Paul D. Thomas, et al. 2006. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucl. Acids Res.34: W645-W650. A list of additional publications is at the bottom of this document. Additional background documentation can be found at: http://www.pantherdb.org/tips/tips_csnpScores.jsp This download contains all the scripts necessary to classify proteins against the PANTHER library ########## Requirements: PANTHER HMM library is available at the ftp site (ftp://ftp.pantherdb.org//panther_library/6.1/). Please note that this is an earlier version of PANTHER library. The tool is not working on the current version of library. Sorry for the inconvenience. UNIX gcc installed on your UNIX machine so that code can be compiled Perl NCBI BLAST (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) HMMER (ftp://selab.janelia.org/pub/software/hmmer/2.3.2/hmmer-2.3.2.tar.gz) Please note that the downloaded HMMER algorithm is an archived version of HMMER2. The current release is HGMMER3. The panther scoring script does NOT support HMMER3. The location to HMMER, Perl, and BLAST binaries must be defined in your $PATH variable. If you have any questions on how to set up $PATH, please contact your UNIX system administrator. ######### Overview of process: -classify protein sequence against PANTHER to determine which HMM the protein best aligns with. This will generate an output file for the following step. -generate subPSEC score for each SNP ########## Usage: -Setup: 1. Download PANTHER HMM library. You must download the version of the library specifically for SNP analysis called: PANTHER6.1_snp.tgz The library is available at the ftp site (ftp://ftp.pantherdb.org/panther_library) or on the downloads page (http://www.pantherdb.org/downloads/index.jsp). 2. Uncompress panther library: gzip -dc PANTHER76.1_snp.tgz | tar -xvf - 3. Place the PANTHER library in the csnpAnalysis1.03 directory. 4. cd csnpAnalysis1.03 5. Make seq_wt program: a. % cd seq_wt_dir b. % make (compile program) c. % ./seq_wt (to test that seq_wt was compiled correctly) c. % cp seq_wt ../ (copy the seq_wt binary one directory back) d. % cd ../ (go back one directory) 6. download BLOSUM62 matrix file from NCBI and place in this directory. The file is at: ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 7. download the uprior.9comp 9-component Dirichlet mixture file from UCSC, and place the file in this directory. The file is at: http://compbio.soe.ucsc.edu/dirichlets/uprior.9comp 8. Create the following files a. SNP input file - give it any names (ex: csnp_input.txt) format (pipe-delimited): snpID|proteinID|amino acid position|wild_type_aa;substituted_aa ex: rs10000|NP_142523|43|A;V (note: the snp ID and protein ID can be any ID) mutant amino acid is the same thing as substituted amino acid *** the format of the last column is flexible. it does not have to wild type amino acid, and then substituted amino acid. The order does not matter, and the program will determine which is the wild type amino acid. Nonsense mutations should be indicated by an asterisk (*). You can also have more than two amino acids - the list must be semicolon separated. b. Fasta file of your protein sequences. Give it any name (ex: protein.fasta). The sequence can include any ID. The definition line for the sequences must only contain the protein ID. **Very important** : ***The proteins in this file must by in sync with what is in the SNP input file. ***The amino acid position in the SNP, and the wild type and substituted amino acids at that position must be insync with the sequence in the fastsa file 9. source panther.cshrc 10. Classify protein sequences against PANTHER % ./pantherScore.pl -l -D B -V -i -o -n -T tmp/ ex: ./pantherScore.pl -l PANTHER6.0 -D B -V -i test.fasta -o scores.out -n -T tmp/ (test.fasta, a sample test fasta file has been provided in this directory) If the fasta file is large, we suggest splitting the fasta file into several different files, and analyzing each file separately. Once all fasta files have been analyzed 11. Classify SNPs % perl snp_analysis.pl -l -c -s -f -b BLOSUM62 -V -p uprior.9comp -o -T tmp/ ex: perl snp_analysis.pl -l PANTHER6.1 -c scores.out -s test_csnpInput.txt -f test.fasta -b BLOSUM62 -V -p uprior.9comp -o snpanalysis.out -T tmp/ If a SNP maps to multiple proteins, by default, the program will take the protein with the best HMm score. Use the -a option if you want the results for all proteins. ex: perl snp_analysis.pl -l PANTHER6.1 -c scores.out -s test_csnpInput.txt -f test.fasta -b BLOSUM62 -V -p uprior.9comp -o snpanalysis.out -T tmp/ -a ########## Output format, and what each column is (tab-separated file): snpId seqId subPSEC - score estimates the likelihood of a functional effect from a single amino acid substitution. Pdeleterious - the probability of a nsSNP being deleterious wtAA - amino acid in protein aaPos - postion of SNP in protein subAA -the substituted amino acid HMM ID - The input protein sequence is scored against the HMMs in the PANTHER library. The alignment to the HMM with the most significant score is used for the analysis. This can be a subfamily (indicated with :SF), or a family. HMM name - the name of the HMM HMM Score - the score of the protein against the HMM hmmPos - The protein sequence is aligned to the HMM. The hmmPos indicates the position in the HMM that the SNP position corresponds to, once the protein sequence is aligned the HMM. msaPos - Different subfamily HMMs in a particular family could have different lengths. The msaPos is the position in the multiple sequence alignment where the SNP position aligns to. The msa position is what is displayed on output for the online version of this tool, since the msaPos corresponds to the MSA shown in the tree viewing software. So, this position is useful if you want to look at the msa/tree on the PANTHER website. sfConsAA - tells you which amino acid (if any) is 100% conserved in the subfamily MSA at the specified postion Pwt - probability of wild type amino acid Psub - probability of substituted/mutant amino acid evolSFs - the list of SFs used to derive the probabilities of the wild type and substituted amino acids evolFam - "Y" if the entire family is used to derive the probabilities. NIC - NIC (number of independent counts) is an estimate of the number of independent observations used to calculate the amino acid probabilities. The probabilities are calculated from a combination of prior knowledge (e.g. that isoleucine often substitutes for valine) and observations, so the larger NIC, the more the probabilities rely on the amino acids observed in the multiple sequence alignment. blosum - blosum score for the substitution. This can be useful if a subPSEC score could not be generated. message message - a message column, as will be describe below in the interpretation of results section ########### Message Column (non-errors): -"No PANTHER HMM hit" - this means that the input protein did not score against an HMM with a score better than E-3 -"Weak PANTHER score" - this means that the input protein did not score against an HMM with a score better than E-23. Proteins that scored greater than 1e-23 are excluded from the analysis, as the alignments are less reliable. -"SNP position in protein does not align to HMM" - If the substitution occurs at a position that does not appear in the multiple sequence alignment, a subPSEC score cannot be generated and the output will return the text string 'SNP position in protein does not align to HMM', indicating that the substitution occurs at a position that is inserted relative to the consensus HMM for the given family. In most cases, these positions are not modeled by the HMMs simply because they do not appear in most of the related sequences; as a result, substitutions at inserted positions are not generally likely to be deleterious. -"wild type amino acid is a stop codon" - the amino acid in the protein at the position specified is a stop codon -"silent mutation" - no score is given for silent mutations -"nonsense mutation" - no score is given for nonsense mutations Message Column (errors): -"wild type and substituted amino acids required" - this indicates that you did not give amino acids, or you did not give two amino acids -"invalid amino acid" - you did not give one of the 20 amino acids -"Missing sequence" - the protein is not in the fasta file -"SNP position not within protein" - the position of the SNP does not exist in the protein. -"wild type amino acid is ..." - this means none of the input amino acids match the amino acid in the protein -"substitution position incorrect" - the position given for the protein is incorrect ######### Interpretation of results: Please see publications at the bottom of this document. How to interpret the score: -The subPSEC (substitution position-specific evolutionary conservation) score estimates the likelihood of a functional effect from a single amino acid substitution. It is the negative logarithm of the probability ratio of the wild-type and mutant amino acids at a particular position. PANTHER subPSEC scores are continuous values from 0 (neutral) to about -10 (most likely to be deleterious). -3 is the previously identified cutoff point for functional significance. A cutoff of -3 corresponds to a 50% probability that a score is deleterious. From this, the probability that a given variant will cause a deleterious effect on protein function is estimated by Pdeleterious, such that a subPSEC score of -3 corresponds to a Pdeleterious of 0.5. ######### trouble shooting: If you have any questions, please contact us at: feedback@pantherdb.org -if you there is an error saying that perl cannot find the Match or BigInt modules, you will need to download this from cpan: http://www.cpan.org/authors/id/T/TE/TELS/math/Math-BigInt-1.77.tar.gz after downloading this file, uncompress it, and place Math-BigInt-1.77/lib/Math/ into the lib/ directory of this directory (pantherScore1.01_csnp). This math module seems to be part of the default installation of the latest version of Linux, but we don't know if this always installed. ########## Troubleshooting the HMM scoring portion of the program: Ultimately, if you have any problems, please contact us at: feedback@pantherdb.org but, before you do that it would be helpful if you can try each of the the following commands: % hmmsearch PANTHER6.1/books/PTHR19264/hmmer.hmm test.fasta % hmmsearch -Z 10000 PANTHER6.1/books/PTHR19264/hmmer.hmm test.fasta % hmmsearch --cpu 1 PANTHER6.1/books/PTHR19264/hmmer.hmm test.fasta All of these commands should run properly and generate results. If they do not, this means that you have a problem with the way you installed hmmsearch, or more likely, you have a problem with how you compiled HMMER. In particular, if you have problems with the --cpu option (if you have problems with this option you might see a POSIX or threads error), you should download and recompile HMMER so that it properly works with threads. HMMER can be downloaded from http://hmmer.wustl.edu/. If you send an email to PANTHER feedback, please tell us the command you are using, and send us the fasta file you are using. ########## Version History: version 1.03 - fix bugs. Redirect to PANTHER6.1 for the scoring. 2/8/2013 version 1.02 - update the links for blast, HMMER and Dirichlet mixture file, and the blast.pm and hmmer.pm modules to handle the new formats from blast and hmmer algorithm. 3/20/2011 1.01 - minor update to gracefully handle the rare case where the number of independent counts is equal to zero. This occurs when all the positions in the HMM are gapped (delete state), but the HMM was not built correctly. The HMM should have been built so that these positions of the HMM cannot be aligned to. Also, added some documentation on trouble shooting. 8/1/07 1.0 - first release Sept 29, 2006 If you have any questions, please contact us at: feedback@pantherdb.org ########## Publications: Paul D. Thomas, et al. 2006. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucl. Acids Res.34: W645-W650. Liam R. Brunham, Roshni R. Singaraja, Terry D. Pape, Anish Kejariwal, Paul D. Thomas, Michael R. Hayden. 2005. Accurate Prediction of the Functional Significance of Single Nucleotide Polymorphisms and Mutations in the ABCA1 Gene. PLoS, 1(6): e83. Paul D. Thomas and Anish Kejariwal. 2004. Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: Evolutionary evidence for differences in molecular effects. Proc. Natl. Acad. Sci., 101(43):15398-403. Paul D. Thomas, Michael J. Campbell, Anish Kejariwal, Huaiyu Mi, Brian Karlak, Robin Daverman, Karen Diemer, Anushya Muruganujan, Apurva Narechania. 2003. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res., 13: 2129-2141 .