copyright (C) 2016 Paul Thomas This file may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. LIMITATION OF WARRANTY NOTHING IN THIS AGREEMENT WILL BE CONSTRUED AS A REPRESENTATION MADE OR WARRANTY GIVEN BY PAUL THOMAS OR ANY THIRD PARTY THAT THE USE OF DATA PROVIDED HEREUNDER WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADEMARK OR OTHER RIGHTS OF ANY THIRD PARTY. DATA IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND WHATSOEVER, EXPRESS OR IMPLIED, INCLUDING IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. PAUL THOMAS MAKES NO WARRANTY THAT ITS DATA DOES NOT CONTAIN ERRORS. ############################################################################ PANTHER-PSEP SNP scoring tool - Version 1.01 http://www.pantherdb.org/downloads/ Updated 09/25/2015 Precalculated results for PANTHER10 is now available: 'PANTHER10_precomputed.txt' The format for this file is: 1st column: species name 2nd column: UniprotKB id 3rd column: amino acid location in protein sequence 4th column: the amino acid in the corresponding location 5th column: the common ancestor at the root of corresponding PANTHER tree 6th column: the common ancestor to which the amino acid is preserved 7th column: the main outcome: preservation time in millions of years (It could be missing if age information is missing for the common ancestor in the 6th column) Updated 10/7/2016 ########## This tool estimates the likelihood of a particular nonsynonymous (amino-acid changing) coding SNP to cause a functional impact on the protein. It reports how long a position in the protein sequence has been preserved by tracing back through its ancestor proteins we reconstructed. In publications that utilize this tool, please reference: Tang & Thomas, PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation, Bioinformatics: btw222 (2016). PMID: 27193693 Additional background documentation can be found at: http://www.pantherdb.org/tips/tips_csnpScores.jsp This download contains all the scripts necessary to classify proteins against the PANTHER library ########## Requirements: PANTHER library of reconstructed ancestor sequences is available in this zip file. Download the latest version of the "PANTHER_Ancestor.tar.gz" library (ex: PANTHER9.0_Ancestor.tar.gz). This version of the library contains the reconstructed ancestor sequences, tree file and rst file of each PANTHER family. PANTHER sequences library (blast)which contains all sequences available in our current PANTHER library. It is available in this zip file. UNIX Perl NCBI BLAST (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) Be sure to install blast+ correctly. Type blastp or ./blastp at command line and hit enter, you should not get the message “blastp: command not found”. The location to Perl, and BLAST binaries must be defined in your $PATH variable. If you have any questions on how to set up $PATH, please contact your UNIX system administrator. ######### Overview of process: 1. Blastp protein sequence against PANTHER sequence library to find the best matching PANTHER family. 2. Retrieve the ancestor sequences of the PANTHER family and trace the target protein through evolution tree. 3. Report how long in millions of years that target position has been preserved. Intepret result. ######### Usage: 1. Download the PANTHER_PSEP1.01.tar.gz. Copy it into a directory you want and extract it tar –zxvf PANTHER_PSEP1.01.tar.gz 2. You’ll find the bellowing files: PANTHER9.blast.tar.gz scripts.tar.gz PANTHER9.PAML_RST.tar.gz README.txt LICENSE 3. Extract scripts.tar.gz tar –zxvf scripts.tar.gz 4. Extract PANTHER blast library: tar -zxvf PANTHER9.blast.tar.gz –C /loc/of/dir/ (You could extract this library and place it anywhere you want, but then you'll need to provide the correct location of this library later.) 5. Extract PANTHER ancestor library tar -zxvf PANTHER9.PAML_RST.tar.gz -C /loc/of/dir/. 6. Place the PANTHER ancestor library at a location you want. (You'll need to provide the correct location of this library later.) 7. cd scripts/ 8. Create the following files a) SNP input file - give it any names (ex: csnp_input.txt) Format (pipe-delimited): snpID|proteinID|amino acid position|wild_type_aa;substituted_aa ex: rs10000|NP_142523|43|A;V (note: the snp ID and protein ID can be any ID) mutant amino acid is the same thing as substituted amino acid *** the format of the last column is flexible. it does not have to wild type amino acid, and then substituted amino acid. The order does not matter, and the program will determine which is the wild type amino acid. Nonsense mutations should be indicated by an asterisk (*). You can also have more than two amino acids - the list must be semicolon separated. b) Fasta file of your protein sequences. Give it any name (ex: protein.fasta). The sequence can include any ID. The definition line for the sequences must only contain the protein ID. **Very important** : ***The proteins in this file must by in sync with what is in the SNP input file. ***The amino acid position in the SNP, and the wild type and substituted amino acids at that position must be insync with the sequence in the fastsa file 9. run controller.pl. This script calls 3 other scripts: 1) run_blastp.pl, 2) parse_blastp.pl and 3) PANTHER_PSEP1.01.pl You should use the following parameters: -f for Fasta file of input sequences -i for SNP input file, which has the format SNPid|proteinid|position|originalAminoAcid;replaceAminoAcid -l for PANTHER_Ancestor library -s for species, if not provided or not known, it is OK. For human, inputs –s HUMAN -d for blast database, the correct address is like /directory/to/file/blast -o for output file, if not provided, then it will be at ./PANTHER_PSEC.output -e for error file (redirect STDERR), if not provided, then it will be ./tmp/err A sample: perl controller.pl -d /dir/to/database/blast -l /dir/to/lib/PANTHER9_PAML_RST -s HUMAN -i sample.input -f sample.fasta -o ./tmp/snp.out -e ./tmp/err Please note: Provide the full path for blast and PANTHER Ancestor library!! 10. If you’ve finished run_blastp.pl and get an output files. Then you can use it directly by running BlastpExist.pl instead of running controller.pl as controller.pl will call and run run_blastp.pl again. As run_blastp is very time consuming for large data sets. A sample: perl BlastpExist.pl -d /dir/to/database/blast -l /dir/to/lib/PANTHER9_PAML_RST -s HUMAN -i ./tmp/snp.input -f .output/file/of/run_blastp.pl -o ./tmp/snp.out -e ./tmp/err ############# Output format and what each column is (comma separated file) Substitution: the input SNP substitution in the format originalAminoAcidPosition replacedAminoAcid, like A29P Result: (1) “Cannot score substitution” if the sequence you input does not match sequences in our PANTHER library or the PANTHER sequences that matches the input sequence doesn't have the same amino acid as the input sequence at the selected position. (2) Age in millions of years that a SNP has been preserved (2) "probably damaging" (if score has an FP rate of < approx 0.2) (age>=380) (3) "possibly damaging" (if score has an FP rate of < approx 0.4) (age>=180) (4) "probably benign" otherwise. ############# Troubleshooting: If the above scripts don’t work properly, you should first view the err message at the location you’ve provide or at ./tmp/err. This err file gives detailed error message. Pay special attention to the locations of blast database and PANTHER_Ancestor library. Be sure to install blast+ correctly. Type blastp or ./blastp at command line and hit enter, you should not get the message “blastp: command not found”. If you have any questions, please contact us at: feedback@pantherdb.org You can send the err file as attachment ########## Version History: Version 1.01 – utilize the new concept of evolutionary preservation. Trace through the most likely ancestor sequences reconstructed to find how long a position has been preserved through the evolution history. If you have any questions, please contact us at: feedback@pantherdb.org ########## Publications: Tang & Thomas, PANTHER-PSEP: predicting disease-causing genetic variants using position-specific evolutionary preservation, Bioinformatics: btw222 (2016). PMID: 27193693