------------------------------------------------------------------------------ Command Line for Running SAVI (Semi-Automated Validation Infrastructure) 3.1 ------------------------------------------------------------------------------ SAVI is a tool created and maintained by the Plant Metabolic Network (https://plantcyc.org) for use in our pipeline for creating pathway genome databases (PGDBs). SAVI is used to apply our previous curation decisions about metabolic pathways to new species databases and to updated versions of existing databases. It relies on previous manual curation by the curators at PMN, and is intended to be used for species that are within Viridiplantae (land plants and green algae). It will not produce correct or useful results for species outside of that clade. Two arguments are required for this program. The first argument is the input folder of SAVI, for example ./input/arabidopsis/ The second argument is the output folder of SAVI, for example ./output/arabidopsis/ Example: runSAVI3.1.sh ./input/arabidopsis/ ./output/arabidopsis/ (Linux, Mac) runSAVI3.1.bat ./input/arabidopsis/ ./output/arabidopsis/ (Windows) The program needs seven pathway library files, two taxonomic files, an E2P2 Pathologic output file, and five enzyme annotation output files as inputs. - Pathway library files (They should exist in the ./input/ folder): 1. Ubiquitous Plant Pathways (UPP.txt): This is used to automatically approve any pathways on the list that are predicted for any land plant species. 2. Common Viridiplantae Pathways (CVP.txt): This is a subset of the UPP that functions in the same manner but can be used for more basal plants and algae. 3. Non-Plant Pathways (NPP.txt): This file contains all curated non-plant pathways. 4. Accept If Predicted Pathways (AIPP.txt): All AIPPs predicted by Pathologic for any species were automatically accepted but no unpredicted pathways will be added to databases from the AIPP. 5. Conditionally Accepted Plant Pathways (CAPP.txt): This will be used to determine which pathways predicted for a given species should be kept based on reaction and/or expected taxonomic range criteria. 6. Manually Check Pathways (Manual.txt): This set of pathways is reserved for manual curation in each species regardless of whether they were predicted. 7. Problem Pathways (problem_pathways.txt): This set of pathways was manually confirmed as rejected pathways. - Taxonomy files (They should exist at './input/' folder): 1. Taxonomic names from NCBI (names.dmp). 2. Taxonomic tree structure from NCBI (nodes.dmp). - An E2P2 Pathologic output file (pathologic.pf) (The file should exist in the input folder given as the first argument) - E2P2 enzyme annotation output files for a species(The file should exist in the input folder given as the first argument): 1. pathway annotation output file (pathways.dat) 2. reaction annotation output file (reactions.dat) 3. species annotation output file (species.dat) 4. protein annotation output file (proteins.dat) 5. pathway annotation output file for previous database (pathways_pgdb.dat, optional) ----------------------------------- Applying the SAVI results to a PGDB ----------------------------------- SAVI produces lists of pathways it suggests should be added to and removed from the PGDB, but it does not make those changes itself. You will therefore have to make the changes manually and the easiest way to do that is using the Pathway Tools lisp interface. SAVI produces two important output files, which it will place in the output folder you gave as the second argument. These files are ic.txt and remove.txt, and they contain the lists of pathways to add and remove, respectively. First, you want to modify these files (or produce modified copies) with some lisp syntax at the start and end. ic.txt should be modified to look like: (setq add '( PWY-8281 PWY-3391 ... PWY-1234 )) while remove.txt should be modified to look like: (setq remove '( PWY-1098 PWY-982 ... PWY-243 )) These will be used to get the two lists of pathways into Pathway Tools and assign them to variables called "add" and "remove". cd to the folder containing the text files and run pathway tools in lisp mode: /path/to/pathway-tools/pathway-tools -lisp When you get a prompt like EC(0): then enter: (so 'my_org_id) (load "ic.txt") (load "remove.txt") Replace my_org_id with the organism ID (the cyc name minus "cyc" at the end). This is what you entered in the "Organism/Project ID" field when creating the PGDB. If you aren't sure what it is, you can get a list of them with: (pprint (all-orgids)) Note: The mismatched single-quote at the beginning of 'my_org_id is correct syntax; it's just how lisp denotes a symbol name. The above load commands will load the lists into the variables add and remove. Now you can delete the pathways listed in remove.txt with: (loop for p in remove do (delete-frame-and-dependents p)) If that worked with no errors, you can now import the ic.txt pathways from Metacyc with: (import-pathways add (find-org 'meta) (current-kb)) If either of the above commands errors out complaining about a pathway not being found, it probably means a nonexistent pathway got into one of the lists (probably it existed in a previous version of Metacyc but has since been removed). In this case you should undo the partially-made changes with (revert-kb), remove the pathway referenced in the error from whichever of the .txt files it's in, reissue the above load commands, and try again. If there were no errors, then save changes to the current database with: (save-kb) If you're going to do anything with the flat files, you should re-generate them using the gui (see the Pathway Tools user manual). You can open the gui from the lisp console with: (pt) When you exit the gui you will be returned to the lisp console. You can exit the lisp console with: (exit) ------------------------- Building SAVI from source ------------------------- SAVI is provided as an already-compiled Java bytecode (savi-3.1.jar), so compiling from source is not necessary for general use. The source code is included, however, and if you do decide to modify it, you can recompile with: ./build-savi.sh ------------------------------- Examples of running the program ------------------------------- This shows the description of the output in the console when the program is run > ./runSAVI3.1.sh input/arabidopsis output/arabidopsis Last update: 02/06/2020 (SAVI 3.1) by Charles Hawkins. Original implemented by Taehyong Kim Input folder name: input/arabidopsis Output folder name: output/arabidopsis Total Number of uppDB (UppParser): 229 Total Number of cvpDB (CvpParser): 181 Total Number of nppDB (NppParser): 238 Total Number of aippDB (AippParser): 168 Total Number of cappDB (CappParser): 632 Total Number of manualDB (ManualParser): 5 Total Number of problemPathwaysDB (ProblemPathwaysParser): 3 Total Number of Taxanomy (TaxanomyParser): 1004613 Total Number of taxanomyName (TaxonomyNameParser): 1003847 Total Number of Pathway (PathwayParser): in input/arabidopsis/pathways.dat: 636 input/arabidopsis/pathways_pgdb.dat is not found!! (The above error means that the reports will not include comparisons to the previous version of the PGDB. If you don't care about that or this is a new PGDB, you can safely ignore it) Total Number of Reations (ReationsParser) : 3789 Total Number of Species (MetacycSpeciesDatParser): 1 Total Number of Pathologic mapping data (PathologicInputParser): 11041 Taxon ID is 3702 (Arabidopsis thaliana) Species is within Embryophyta, using UPP.txt Total Number of lines in result file (output/arabidopsis/Accepted.txt): 538 {CAPP=6.0, problemPathways=test-th-01, UPP=8.0, AIPP=6.0, MANUAL=test-9, CVP=6.0, NPP=9.0} Total Number of lines in result file (output/arabidopsis/Accepted.txt): 538 Total Number of lines in result file (output/arabidopsis/Rejected.txt): 47 Total Number of lines in result file (output/arabidopsis/Manual-to-validate.txt): 57 Self validation was passed!: All pathways in Pathway.dat exist and exist only once across result files. Self validation was passed!: All UPP pathways exist in Accepted.txt. PWY-3385 exists in both UPP(CVP) and CAPP files) PWY-6466 exists in both UPP(CVP) and CAPP files) PWY3O-450 exists in both UPP(CVP) and CAPP files) PWY-5669 exists in both UPP(CVP) and CAPP files) PWY-5723 exists in both CAPP and manual files) Last update: 05/20/2014 (SAVI validation 1.03) Input folder name: input/arabidopsis Output folder name: output/arabidopsis Total Number of line (FirstTokenParser) in output/arabidopsis/Accepted.txt: 538 Total Number of line (FirstTokenParser) in output/arabidopsis/Rejected.txt: 47 Total Number of line (FirstTokenParser) in output/arabidopsis/Manual-to-validate.txt: 57 Total Number of Pathway (PathwayParser): in input/arabidopsis/pathways.dat: 636 input/arabidopsis/pathways_pgdb.dat is not found!! Last updates: 08/15/2014 (SAVI PGDB input file generation 1.02) Output folder name: output/arabidopsis Number of pathways in aipp.txt: 123 Number of pathways in ic.txt: 3 Number of pathways in comp-upp.txt: 226 Number of pathways in comp-cvp.txt: 0 Number of pathways in comp-super.txt: 20 Number of pathways in comp-rxn-warning.txt: 3 Number of pathways in comp-taxon.txt: 12 Number of pathways in comp-rxn-taxon.txt: 22 Number of pathways in comp-rxn.txt: 128 Number of pathways in remove.txt: 46 ------------------------- Examples of result files ------------------------- After success of running the program, seven output files will be generated at at './output/' folder. 1. Accepted pathway list file (Acepted.txt) 2. Rejected pathway list file (Rejected.txt) 3. Pathways that need to be manually validated (Manual-to-validate.txt) 4. All pathway list in pathways.dat (inputPathwayList.txt) 5. Pathway information in the previous database (PreviousPGDB.txt) 6. Summary of input files and results (validationReport.txt) 7. Pathway comparison between the pathways of input and the pathways in the previous database (optional, compareResultWithPGDB.txt) 8. Additional information for generating PGDB database (optional, comp-rxn-taxon.txt, comp-rxn-warning.txt, ic.txt, comp-rxn.txt, comp-super.txt, remove.txt, aipp.txt, comp-taxon.txt, all_pathways_previous_pgdb.txt, comp-upp.txt) --------------- Troubleshooting --------------- - I ran SAVI but my PGDB is completely unchanged SAVI makes recommendations for changes to the PGDB but does not make those changes itself. See the above section "Applying the SAVI results to a PGDB" for instructions on how to do this. - SAVI says it can't find pathways_pbgb.dat The pathways_pgdb.dat file is used in making comparisons to the previous version of the PGDB. This is only used in generating reports, and does not affect curation decisions. If you do want these to be included in the reports, copy the pathways.dat file of the previous PGDB version to the input folder as pathways_pgdb.dat. If this is a new PGDB and has no previous version, you can ignore this entirely. - SAVI cannot find the taxon ID for my species - it's blank or NIL Sometimes Pathway Tools does not correctly export the taxon information for a PGDB when generating the flat files. What SAVI is looking for is a line in species.dat that looks like: NCBI-TAXONOMY-ID - 3702 where 3702 is the taxon ID of your species from NCBI's taxonomy database. If the line is omitted or lacks an ID, you can look it up using NCBI Taxon by going to https://www.ncbi.nlm.nih.gov/taxonomy/ and searching for your species. When you've found it in the search results, click to go to the taxonomy page and look for a line near the top that looks like "Taxonomy ID: 3702 (for references in articles please use NCBI:txid3702)". That 3702 is what you're looking for. Copy it to the NCBI-TAXONOMY-ID line in species.dat and try running SAVI again. - My species taxon ID is correct but SAVI still says it cannot find it in nodes.dmp The nodes.dmp file is obtained from NCBI Taxonomy and contains the complete taxonomic tree from NCBI Taxonomy. SAVI uses it to determine whether the species is in Embryophyta and therefore whether to curate using UPP.txt or CVP.txt, and also for taxonomy-dependent curation rules in CAPP.txt. SAVI does not update this file automatically, so you will need to do so yourself if your species was added to NCBI Taxonomy subsequent to the last time SAVI's copy was updated. To obtain an updated copy of nodes.dmp (along with names.dmp which contains the "proper" names for each taxon ID), go to ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download taxdump.tar.gz. This archive will contain up-to-date versions of both files. - SAVI fails with out-of-memory errors (may happen in particular after updating nodes.dmp and names.dmp) SAVI is written in Java and Java requires that a maximum memory allocation be specified at the command line. This is done in runSAVI3.1.sh / runSAVI3.1.bat. Look for three lines that contain a component that looks like -Xmx1024m. That is the maximum set memory (1024 MB or 1GB). You can edit the appropriate runSAVI file to change all three to something higher, like -Xmx2048m or higher. ----------------------------- Contents of files in folders ----------------------------- ./ (batch files for running the program) README runSAVI3.1.bat runSAVI3.1.sh savi-3.1.jar build-savi.sh ./input (Common input files for running the program) AIPP.txt CAPP.txt CVP.txt Manual.txt NPP.txt UPP.txt problem_pathways.txt names.dmp nodes.dmp ./input/arabidopsis (species specific input files for running the program) pathologic.pf pathways.dat pathways_pgdb.dat proteins.dat reactions.dat species.dat ./output/arabidopsis (output examples after running the program) Accepted.txt Manual-to-validate.txt PreviousPGDB.txt Rejected.txt aipp.txt all_pathways_previous_pgdb.txt comp-rxn-taxon.txt comp-rxn-warning.txt comp-rxn.txt comp-super.txt comp-taxon.txt comp-upp.txt compareResultWithPGDB.txt ic.txt inputPathwayList.txt remove.txt validationReport.txt ./sourceCode/ (Java source code for SAVI) carnegie sourceCode//carnegie: bioinfo sourceCode//carnegie/bioinfo: common metacyc savi sourceCode//carnegie/bioinfo/common: ConstantsForCycDatParser.java ConstantsForGeneralPurpose.java parser ConstantsForEvidenceCode.java VersionParameters.java util sourceCode//carnegie/bioinfo/common/parser: AbstractParser.java IParser.java sourceCode//carnegie/bioinfo/common/util: Logger.java MathUtil.java PrintUtil.java RandDemo.java sourceCode//carnegie/bioinfo/metacyc: parser sourceCode//carnegie/bioinfo/metacyc/parser: MetacycPathwayDatParser.java MetacycSpeciesDatParser.java fileDB MetacycReactionDatParser.java PathologicInputParser.java sourceCode//carnegie/bioinfo/metacyc/parser/fileDB: EnzrxnsDB.java PathwayDB.java ReactionDB.java PathologicDB.java ProteinDB.java SpeciesDB.java sourceCode//carnegie/bioinfo/savi: SaviMain.java filegeneration input validation sourceCode//carnegie/bioinfo/savi/filegeneration: AcceptedListDB.java FirstTokenParser.java AcceptedListParser.java SaviPgdbInputGenerationMain.java sourceCode//carnegie/bioinfo/savi/input: parser sourceCode//carnegie/bioinfo/savi/input/parser: AippParser.java NppParser.java TaxonomyParser.java CappParser.java ProblemPathwaysParser.java UppParser.java CvpParser.java ResultFileParser.java fileDB ManualParser.java TaxonomyNameParser.java sourceCode//carnegie/bioinfo/savi/input/parser/fileDB: AbCommonDB.java CappDB.java ProblemPathwaysDB.java AbUppCvpDB.java CvpDB.java TaxonomyDB.java AippDB.java ManualDB.java UppDB.java sourceCode//carnegie/bioinfo/savi/validation: FirstTokenParser.java SaviValidationMain.java --------- Changelog --------- SAVI 3.1 (Feb 6, 2020): - May updates to the README.txt file: - Started keeping this change log - Added instructions for applying the changes to a PGDB - Added a troubleshooting section - Corrected outdated information (all of the references to ./SAVI3.02/ were changed to ./) - Corrected some grammtical errors - Updated UPP, CAPP, AIPP, and NPP with new curation for PMN 14 - SAVI no longer runs incorrectly if the input and outupt folders are supplied without a trailing slash - SAVI now outputs the species taxon ID and name, and whether it will curate using UPP.txt or CVP.txt - SAVI now outputs a warning if the taxon ID was not found in nodes.dmp, indicating either that the taxon ID is wrong or that nodes.dmp needs to be updated - If pathways_pgdb.dat is not found, an explanation is printed, since this error can usually be ignored - Added a script, build-savi.sh, to (re)build SAVI from source - Updated the included arabidopsis example input data to be from the current version of Aracyc (17.0) before we ran SAVI on it. The previous example data was an old version of the Carica papaya PGDB incorrectly labelled as arabidopsis.