copyright (C) 2013 Paul Thomas This file may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. LIMITATION OF WARRANTY NOTHING IN THIS AGREEMENT WILL BE CONSTRUED AS A REPRESENTATION MADE OR WARRANTY GIVEN BY PAUL THOMAS OR ANY THIRD PARTY THAT THE USE OF DATA PROVIDED HEREUNDER WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK OR OTHER RIGHTS OF ANY THIRD PARTY. DATA IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. PAUL THOMAS MAKES NO WARRANTY THAT ITS DATA DOES NOT CONTAIN ERRORS. ############################################################################ PANTHER SNP scoring tools - Version 2.01 http://www.pantherdb.org/downloads/ 3/20/2011 Updated 10/2/2012 Updated 11/17/2013 ########## This tool estimates the likelihood of a particular nonsynonymous (amino-acid changing) coding SNP to cause a functional impact on the protein. It reports how long a position in the protein sequence has been preserved by tracing back through its ancestor proteins we reconstructed. Please refer to: A list of additional publications is at the bottom of this document. Additional background documentation can be found at: http://www.pantherdb.org/tips/tips_csnpScores.jsp This download contains all the scripts necessary to classify proteins against the PANTHER library ########## Requirements: PANTHER library of reconstructed ancestor sequences is available in this zip file. Download the latest version of the "PANTHER_Ancestor.tar.gz" library (ex: PANTHER8.0_Ancestor.tar.gz). This version of the library contains the reconstructed ancestor sequences, tree file and rst file of each PANTHER family. PANTHER sequences library (blast)which contains all sequences available in our current PANTHER library. It is available in this zip file. UNIX Perl NCBI BLAST (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) Be sure to install blast+ correctly. Type blastp or ./blastp at command line and hit enter, you should not get the message “blastp: command not found”. The location to Perl, and BLAST binaries must be defined in your $PATH variable. If you have any questions on how to set up $PATH, please contact your UNIX system administrator. ######### Overview of process: 1. Blastp protein sequence against PANTHER sequence library to find the best matching PANTHER family. 2. Retrieve the ancestor sequences of the PANTHER family and trace the target protein through evolution tree. 3. Report how long in millions of years that target position has been preserved. Intepret result. ######### Usage: 1. Download the csnpAnalysis.tar.gz. Copy it into a directory you want and extract it tar –zxvf csnpAnalysis.tar.gz 2. You’ll find the bellowing files: blast.tar.gz csnpAnalysis_2.01.tar.gz PANTHER8.0_Ancestor.tar.gz README.txt LICENSE 3. Extract csnpAnalysis_2.01.tar.gz tar –zxvf csnpAnalysis_2.01.tar.gz 4. Extract PANTHER library: tar -zxvf blast.tar.gz –C /loc/of/dir/ (You could extract this library and place it anywhere you want, but then you'll need to provide the correct location of this library later.) 5. Extract PANTHER ancestor library tar -zxvf PANTHER8.0_Ancestor.tar.gz. 6. Place the PANTHER ancestor library at a location you want. (You'll need to provide the correct location of this library later.) 7. cd csnpAnalysis_2.01 8. Create the following files a) SNP input file - give it any names (ex: csnp_input.txt) Format (pipe-delimited): snpID|proteinID|amino acid position|wild_type_aa;substituted_aa ex: rs10000|NP_142523|43|A;V (note: the snp ID and protein ID can be any ID) mutant amino acid is the same thing as substituted amino acid *** the format of the last column is flexible. it does not have to wild type amino acid, and then substituted amino acid. The order does not matter, and the program will determine which is the wild type amino acid. Nonsense mutations should be indicated by an asterisk (*). You can also have more than two amino acids - the list must be semicolon separated. b) Fasta file of your protein sequences. Give it any name (ex: protein.fasta). The sequence can include any ID. The definition line for the sequences must only contain the protein ID. **Very important** : ***The proteins in this file must by in sync with what is in the SNP input file. ***The amino acid position in the SNP, and the wild type and substituted amino acids at that position must be insync with the sequence in the fastsa file 9. run controller.pl. This script calls 3 other scripts: 1) run_blastp.pl, 2) parse_blastp.pl and 3) csnp_analysis.pl You should prove the following parameters: -f for Fasta file of input sequences -i for SNP input file, which has the format SNPid|proteinid|position|originalAminoAcid;replaceAminoAcid -l for PANTHER_Ancestor library -s for species, if not provided or not known, it is OK. For human, inputs –s HUMAN -d for blast database, the correct address is like /directory/to/file/blast -o for output file, if not provided, then it will be at ./PANTHER_PSEC.output -e for error file (redirect STDERR), if not provided, then it will be ./tmp/err A sample: perl controller.pl -d /dir/to/database/blast -l /dir/to/lib/PAML_RST -s HUMAN -i .sample.input -f .sample.fasta -o ./tmp/snp.out -e ./tmp/err Please note: Provide the full address for blast and PANTHER Ancestor library!! 10. If you’ve finished run_blastp.pl and get an output files. Then you can use it directly by running BlastpExist.pl instead of running controller.pl as controller.pl will call and run run_blastp.pl again. As run_blastp is very time consuming for large data sets. A sample: perl BlastpExist.pl -d /dir/to/database/blast -l /dir/to/lib /PANTHER8.0_Ancestor -s HUMAN -i ./tmp/snp.input -f ./blastout.1099 -o ./tmp/snp.out -e ./tmp/err ############# Output format and what each column is (comma separated file) Substitution: the input SNP substitution in the format originalAminoAcidPosition replaceAminioAcid like A29P Result: (1) “Cannot score substitution” if the sequence you input does not match sequences in our PANTHER library or the PANTHER sequences that matches the input sequence doesn't have the same amino acid as the input sequence at the selected position. (2) Age in millions of years that a SNP has been preserved (2) "probably damaging" (if score has an FP rate of < approx 0.2) (age>=380) (3) "possibly damaging" (if score has an FP rate of < approx 0.4) (age>=180) (4) "probably benign" otherwise. ############# Troubleshooting: If the above scripts don’t work properly, you should first view the err message at the location you’ve provide or at ./tmp/err. This err file gives detailed error message. Pay special attention to the locations of blast database and PANTHER_Ancestor library. Be sure to install blast+ correctly. Type blastp or ./blastp at command line and hit enter, you should not get the message “blastp: command not found”. If you have any questions, please contact us at: feedback@pantherdb.org You can send the err file as attachment ########## Version History: Version 2.01 – utilize the new concept of evolutionary preservation. Trace through the most likely ancestor sequences reconstructed to find how long a position has been preserved through the evolution history. version 1.02 - update the links for blast, HMMER and Dirichlet mixture file, and the blast.pm and hmmer.pm modules to handle the new formats from blast and hmmer algorithm. 3/20/2011 1.01 - minor update to gracefully handle the rare case where the number of independent counts is equal to zero. This occurs when all the positions in the HMM are gapped (delete state), but the HMM was not built correctly. The HMM should have been built so that these positions of the HMM cannot be aligned to. Also, added some documentation on trouble shooting. 8/1/07 1.0 - first release Sept 29, 2006 If you have any questions, please contact us at: feedback@pantherdb.org ########## Publications: ???