------------------------------------------------------------------------------
Command Line for Running SAVI (Semi-Automated Validation Infrastructure) 3.1
------------------------------------------------------------------------------

SAVI is a tool created and maintained by the Plant Metabolic Network (https://plantcyc.org) for use in our pipeline for creating pathway genome databases (PGDBs). SAVI is used to apply our previous curation decisions about metabolic pathways to new species databases and to updated versions of existing databases. It relies on previous manual curation by the curators at PMN, and is intended to be used for species that are within Viridiplantae (land plants and green algae). It will not produce correct or useful results for species outside of that clade.

Two arguments are required for this program. 
The first argument is the input folder of SAVI, for example ./input/arabidopsis/
The second argument is the output folder of SAVI, for example ./output/arabidopsis/

Example:  
	runSAVI3.1.sh ./input/arabidopsis/ ./output/arabidopsis/ (Linux, Mac)
	runSAVI3.1.bat ./input/arabidopsis/ ./output/arabidopsis/ (Windows)

The program needs seven pathway library files, two taxonomic files, an E2P2 Pathologic output file, and five enzyme annotation output files as inputs.

- Pathway library files (They should exist in the ./input/ folder):
1. Ubiquitous Plant Pathways (UPP.txt): This is used to automatically approve any pathways on the list that are predicted for any land plant species.
2. Common Viridiplantae Pathways (CVP.txt): This is a subset of the UPP that functions in the same manner but can be used for more basal plants and algae.
3. Non-Plant Pathways (NPP.txt): This file contains all curated non-plant pathways.
4. Accept If Predicted Pathways (AIPP.txt): All AIPPs predicted by Pathologic for any species were automatically accepted but no unpredicted pathways will be added to databases from the AIPP. 
5. Conditionally Accepted Plant Pathways (CAPP.txt): This will be used to determine which pathways predicted for a given species should be kept based on reaction and/or expected taxonomic range criteria. 
6. Manually Check Pathways (Manual.txt): This set of pathways is reserved for manual curation in each species regardless of whether they were predicted.
7. Problem Pathways (problem_pathways.txt): This set of pathways was manually confirmed as rejected pathways.

- Taxonomy files (They should exist at './input/' folder):
1. Taxonomic names from NCBI (names.dmp).
2. Taxonomic tree structure from NCBI (nodes.dmp).

- An E2P2 Pathologic output file (pathologic.pf) (The file should exist in the input folder given as the first argument)

- E2P2 enzyme annotation output files for a species(The file should exist in the input folder given as the first argument):
1. pathway annotation output file (pathways.dat)
2. reaction annotation output file (reactions.dat)
3. species annotation output file (species.dat)
4. protein annotation output file (proteins.dat)
5. pathway annotation output file for previous database (pathways_pgdb.dat, optional)


-----------------------------------
Applying the SAVI results to a PGDB
-----------------------------------
SAVI produces lists of pathways it suggests should be added to and removed from the PGDB, but it does not make those changes itself. You will therefore have to make the changes manually and the easiest way to do that is using the Pathway Tools lisp interface.

SAVI produces two important output files, which it will place in the output folder you gave as the second argument. These files are ic.txt and remove.txt, and they contain the lists of pathways to add and remove, respectively. First, you want to modify these files (or produce modified copies) with some lisp syntax at the start and end. ic.txt should be modified to look like:

	(setq add '(
	PWY-8281
	PWY-3391
	...
	PWY-1234
	))

while remove.txt should be modified to look like:

	(setq remove '(
	PWY-1098
	PWY-982
	...
	PWY-243
	))

These will be used to get the two lists of pathways into Pathway Tools and assign them to variables called "add" and "remove". cd to the folder containing the text files and run pathway tools in lisp mode:

	/path/to/pathway-tools/pathway-tools -lisp

When you get a prompt like EC(0): then enter:
 
	(so 'my_org_id)
	(load "ic.txt")
	(load "remove.txt")

Replace my_org_id with the organism ID (the cyc name minus "cyc" at the end). This is what you entered in the "Organism/Project ID" field when creating the PGDB. If you aren't sure what it is, you can get a list of them with:

	(pprint (all-orgids))

Note: The mismatched single-quote at the beginning of 'my_org_id is correct syntax; it's just how lisp denotes a symbol name.

The above load commands will load the lists into the variables add and remove. Now you can delete the pathways listed in remove.txt with:

	(loop for p in remove do (delete-frame-and-dependents p))

If that worked with no errors, you can now import the ic.txt pathways from Metacyc with:

	(import-pathways add (find-org 'meta) (current-kb))

If either of the above commands errors out complaining about a pathway not being found, it probably means a nonexistent pathway got into one of the lists (probably it existed in a previous version of Metacyc but has since been removed). In this case you should undo the partially-made changes with (revert-kb), remove the pathway referenced in the error from whichever of the .txt files it's in, reissue the above load commands, and try again. If there were no errors, then save changes to the current database with:
 
	(save-kb)
 
If you're going to do anything with the flat files, you should re-generate them using the gui (see the Pathway Tools user manual). You can open the gui from the lisp console with:
 
	(pt)
 
When you exit the gui you will be returned to the lisp console. You can exit the lisp console with:
 
	(exit)

-------------------------
Building SAVI from source
-------------------------
SAVI is provided as an already-compiled Java bytecode (savi-3.1.jar), so compiling from source is not necessary for general use. The source code is included, however, and if you do decide to modify it, you can recompile with:

	./build-savi.sh

-------------------------------
Examples of running the program
-------------------------------
This shows the description of the output in the console when the program is run
> ./runSAVI3.1.sh input/arabidopsis output/arabidopsis
Last update: 02/06/2020 (SAVI 3.1) by Charles Hawkins. Original implemented by Taehyong Kim
Input folder name: input/arabidopsis
Output folder name: output/arabidopsis
Total Number of uppDB (UppParser): 229
Total Number of cvpDB (CvpParser): 181
Total Number of nppDB (NppParser): 238
Total Number of aippDB (AippParser): 168
Total Number of cappDB (CappParser): 632
Total Number of manualDB (ManualParser): 5
Total Number of problemPathwaysDB (ProblemPathwaysParser): 3
Total Number of Taxanomy (TaxanomyParser): 1004613
Total Number of taxanomyName (TaxonomyNameParser): 1003847
Total Number of Pathway (PathwayParser): in input/arabidopsis/pathways.dat: 636
input/arabidopsis/pathways_pgdb.dat is not found!!
(The above error means that the reports will not include comparisons to the previous version of the PGDB. If you don't care about that or this is a new PGDB, you can safely ignore it)
Total Number of Reations (ReationsParser) : 3789
Total Number of Species (MetacycSpeciesDatParser): 1
Total Number of Pathologic mapping data (PathologicInputParser): 11041
Taxon ID is 3702 (Arabidopsis thaliana)
Species is within Embryophyta, using UPP.txt
Total Number of lines in result file (output/arabidopsis/Accepted.txt): 538
{CAPP=6.0, problemPathways=test-th-01, UPP=8.0, AIPP=6.0, MANUAL=test-9, CVP=6.0, NPP=9.0}
Total Number of lines in result file (output/arabidopsis/Accepted.txt): 538
Total Number of lines in result file (output/arabidopsis/Rejected.txt): 47
Total Number of lines in result file (output/arabidopsis/Manual-to-validate.txt): 57
Self validation was passed!: All pathways in Pathway.dat exist and exist only once across result files.
Self validation was passed!: All UPP pathways exist in Accepted.txt.
PWY-3385 exists in both UPP(CVP) and CAPP files)
PWY-6466 exists in both UPP(CVP) and CAPP files)
PWY3O-450 exists in both UPP(CVP) and CAPP files)
PWY-5669 exists in both UPP(CVP) and CAPP files)
PWY-5723 exists in both CAPP and manual files)
Last update: 05/20/2014 (SAVI validation 1.03)
Input folder name: input/arabidopsis
Output folder name: output/arabidopsis
Total Number of line (FirstTokenParser) in output/arabidopsis/Accepted.txt: 538
Total Number of line (FirstTokenParser) in output/arabidopsis/Rejected.txt: 47
Total Number of line (FirstTokenParser) in output/arabidopsis/Manual-to-validate.txt: 57
Total Number of Pathway (PathwayParser): in input/arabidopsis/pathways.dat: 636
input/arabidopsis/pathways_pgdb.dat is not found!!
Last updates: 08/15/2014 (SAVI PGDB input file generation 1.02)
Output folder name: output/arabidopsis
Number of pathways in aipp.txt: 123
Number of pathways in ic.txt: 3
Number of pathways in comp-upp.txt: 226
Number of pathways in comp-cvp.txt: 0
Number of pathways in comp-super.txt: 20
Number of pathways in comp-rxn-warning.txt: 3
Number of pathways in comp-taxon.txt: 12
Number of pathways in comp-rxn-taxon.txt: 22
Number of pathways in comp-rxn.txt: 128
Number of pathways in remove.txt: 46

-------------------------
Examples of result files
-------------------------
After success of running the program, seven output files will be generated at at './output/' folder.
1. Accepted pathway list file (Acepted.txt)
2. Rejected pathway list file (Rejected.txt)
3. Pathways that need to be manually validated (Manual-to-validate.txt)
4. All pathway list in pathways.dat (inputPathwayList.txt)
5. Pathway information in the previous database (PreviousPGDB.txt)
6. Summary of input files and results (validationReport.txt)
7. Pathway comparison between the pathways of input and the pathways in the previous database (optional, compareResultWithPGDB.txt)
8. Additional information for generating PGDB database (optional, comp-rxn-taxon.txt, comp-rxn-warning.txt, ic.txt, comp-rxn.txt, comp-super.txt, remove.txt, aipp.txt, comp-taxon.txt, all_pathways_previous_pgdb.txt, comp-upp.txt)

---------------
Troubleshooting
---------------
- I ran SAVI but my PGDB is completely unchanged

SAVI makes recommendations for changes to the PGDB but does not make those changes itself. See the above section "Applying the SAVI results to a PGDB" for instructions on how to do this.

- SAVI says it can't find pathways_pbgb.dat

The pathways_pgdb.dat file is used in making comparisons to the previous version of the PGDB. This is only used in generating reports, and does not affect curation decisions. If you do want these to be included in the reports, copy the pathways.dat file of the previous PGDB version to the input folder as pathways_pgdb.dat. If this is a new PGDB and has no previous version, you can ignore this entirely.

- SAVI cannot find the taxon ID for my species - it's blank or NIL

Sometimes Pathway Tools does not correctly export the taxon information for a PGDB when generating the flat files. What SAVI is looking for is a line in species.dat that looks like:

NCBI-TAXONOMY-ID - 3702

where 3702 is the taxon ID of your species from NCBI's taxonomy database. If the line is omitted or lacks an ID, you can look it up using NCBI Taxon by going to https://www.ncbi.nlm.nih.gov/taxonomy/ and searching for your species. When you've found it in the search results, click to go to the taxonomy page and look for a line near the top that looks like "Taxonomy ID: 3702 (for references in articles please use NCBI:txid3702)". That 3702 is what you're looking for. Copy it to the NCBI-TAXONOMY-ID line in species.dat and try running SAVI again.

- My species taxon ID is correct but SAVI still says it cannot find it in nodes.dmp

The nodes.dmp file is obtained from NCBI Taxonomy and contains the complete taxonomic tree from NCBI Taxonomy. SAVI uses it to determine whether the species is in Embryophyta and therefore whether to curate using UPP.txt or CVP.txt, and also for taxonomy-dependent curation rules in CAPP.txt. SAVI does not update this file automatically, so you will need to do so yourself if your species was added to NCBI Taxonomy subsequent to the last time SAVI's copy was updated. To obtain an updated copy of nodes.dmp (along with names.dmp which contains the "proper" names for each taxon ID), go to ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ and download taxdump.tar.gz. This archive will contain up-to-date versions of both files.

- SAVI fails with out-of-memory errors (may happen in particular after updating nodes.dmp and names.dmp)

SAVI is written in Java and Java requires that a maximum memory allocation be specified at the command line. This is done in runSAVI3.1.sh / runSAVI3.1.bat. Look for three lines that contain a component that looks like -Xmx1024m. That is the maximum set memory (1024 MB or 1GB). You can edit the appropriate runSAVI file to change all three to something higher, like -Xmx2048m or higher.

-----------------------------
Contents of files in folders
-----------------------------
./ (batch files for running the program)
README
runSAVI3.1.bat
runSAVI3.1.sh
savi-3.1.jar
build-savi.sh

./input (Common input files for running the program)
AIPP.txt
CAPP.txt
CVP.txt
Manual.txt
NPP.txt
UPP.txt
problem_pathways.txt
names.dmp
nodes.dmp

./input/arabidopsis (species specific input files for running the program)
pathologic.pf
pathways.dat
pathways_pgdb.dat
proteins.dat
reactions.dat
species.dat

./output/arabidopsis (output examples after running the program)
Accepted.txt
Manual-to-validate.txt
PreviousPGDB.txt
Rejected.txt
aipp.txt
all_pathways_previous_pgdb.txt
comp-rxn-taxon.txt
comp-rxn-warning.txt
comp-rxn.txt
comp-super.txt
comp-taxon.txt
comp-upp.txt
compareResultWithPGDB.txt
ic.txt
inputPathwayList.txt
remove.txt
validationReport.txt

./sourceCode/ (Java source code for SAVI)
carnegie

sourceCode//carnegie:
bioinfo

sourceCode//carnegie/bioinfo:
common	metacyc	savi

sourceCode//carnegie/bioinfo/common:
ConstantsForCycDatParser.java	ConstantsForGeneralPurpose.java	parser
ConstantsForEvidenceCode.java	VersionParameters.java		util

sourceCode//carnegie/bioinfo/common/parser:
AbstractParser.java	IParser.java

sourceCode//carnegie/bioinfo/common/util:
Logger.java	MathUtil.java	PrintUtil.java	RandDemo.java

sourceCode//carnegie/bioinfo/metacyc:
parser

sourceCode//carnegie/bioinfo/metacyc/parser:
MetacycPathwayDatParser.java	MetacycSpeciesDatParser.java	fileDB
MetacycReactionDatParser.java	PathologicInputParser.java

sourceCode//carnegie/bioinfo/metacyc/parser/fileDB:
EnzrxnsDB.java		PathwayDB.java		ReactionDB.java
PathologicDB.java	ProteinDB.java		SpeciesDB.java

sourceCode//carnegie/bioinfo/savi:
SaviMain.java	filegeneration	input		validation

sourceCode//carnegie/bioinfo/savi/filegeneration:
AcceptedListDB.java			FirstTokenParser.java
AcceptedListParser.java			SaviPgdbInputGenerationMain.java

sourceCode//carnegie/bioinfo/savi/input:
parser

sourceCode//carnegie/bioinfo/savi/input/parser:
AippParser.java			NppParser.java			TaxonomyParser.java
CappParser.java			ProblemPathwaysParser.java	UppParser.java
CvpParser.java			ResultFileParser.java		fileDB
ManualParser.java		TaxonomyNameParser.java

sourceCode//carnegie/bioinfo/savi/input/parser/fileDB:
AbCommonDB.java		CappDB.java		ProblemPathwaysDB.java
AbUppCvpDB.java		CvpDB.java		TaxonomyDB.java
AippDB.java		ManualDB.java		UppDB.java

sourceCode//carnegie/bioinfo/savi/validation:
FirstTokenParser.java	SaviValidationMain.java

---------
Changelog
---------

SAVI 3.1 (Feb 6, 2020):
- May updates to the README.txt file:
	- Started keeping this change log
	- Added instructions for applying the changes to a PGDB
	- Added a troubleshooting section
	- Corrected outdated information (all of the references to ./SAVI3.02/ were changed to ./)
	- Corrected some grammtical errors
- Updated UPP, CAPP, AIPP, and NPP with new curation for PMN 14
- SAVI no longer runs incorrectly if the input and outupt folders are supplied without a trailing slash
- SAVI now outputs the species taxon ID and name, and whether it will curate using UPP.txt or CVP.txt
- SAVI now outputs a warning if the taxon ID was not found in nodes.dmp, indicating either that the taxon ID is wrong or that nodes.dmp needs to be updated
- If pathways_pgdb.dat is not found, an explanation is printed, since this error can usually be ignored
- Added a script, build-savi.sh, to (re)build SAVI from source
- Updated the included arabidopsis example input data to be from the current version of Aracyc (17.0) before we ran SAVI on it. The previous example data was an old version of the Carica papaya PGDB incorrectly labelled as arabidopsis.