-
Notifications
You must be signed in to change notification settings - Fork 1
/
ReadMe.txt
275 lines (217 loc) · 10.4 KB
/
ReadMe.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
orthomcl starts with gene protein sequences grouped by genomes and generates
ortholog groups by creating input for the mcl program with input based on the
all by all blast of the sequences.
FastOrtho is a reimplementation of the orthomcl program that does not require
the use of databases or perl.
To create the FastOrtho executable type make in the src directory.
There are many input options(listed below) for FastOrtho which are probably
most easily configured by using the included SetFast.jar GUI to create
an option file that can be given to FastOrtho as its only input.
(You may need to use the line
java -jar SetFast.jar
to use the GUI)
FastOrtho Command line Options (see below for sample use)
--option_file file_name
Used to read options for a file.
Expects at most one option per line in the file.
//////// OPERATIONAL INPUT PARAMETER copied from OrthoMCL //////
--pv_cutoff maximum_e_value
Used to discard blast hits with large e-values.
default = 1-e5
--pi_cutoff minimum_percent_identity
used to discard blast hits with small percent identity values
Percent identity for a query subject pair is based on
the weighted mean of all blast lines for the query subject pair
The weight of a line is based on the length of it alignement section.
default = 0.0 (does not generate any discards)
--pmatch_cutoff minimum_percent_matching
Used to discard blast hits in which too small a percent of the
protein sequences are involved in the blast alignments.
Percent applies to shorter of the query and subject sequences.
default = 0.0 (does not generate any discards)
--maximum_weight numeric_value
Weights for mcl are computed using -log10(e-value) from blast hits
which has no meaining for an e-value of 0.0.
This value is used inplace of -log10(0.0).
default = 316.0
316.0 is larger than -log10(x) where x is larger than 0
and x can be stored as a double floating point primitive.
--inflation numeric_value
Provides a value for the -I option to use when FastOrtho
calls the mcl program.
default = 1.5
--blast_cpus numeric_value
Only used when FastOrtho handles launching NCBI blast.
Provides a value for blastall -a option or the blastp -num_threads.
default = 1
--blast_b numeric_value
Only used when FastOrtho handles launching NCBI blast.
Provides a value for blastall -b option or the blastp -num_descriptions.
default = 1000
--blast_e numeric_value
Only used when FastOrtho handles launching NCBI blast.
Provides a value for blastall -3 option or the blastp -evalue.
default = 1e-5
--blast_v numeric_value
Only used when FastOrtho handles launching NCBI blast.
Provides a value for blastall -v option or the blastp -num_alignments.
default = 1000
--only_fastas
Allows FastOrtho to be used to create a combined protein sequence
file for use by an NCBI blast called outside of FastOrtho.
It also results in a *.glg file being generated for later use
with the --gg_file option.
//////// FILE LOCATION SPECIFICATION //////
--single_genome_fasta file_path
Describes a protein sequence file to be used as input for
NCBI blast processing. All sequences in this file will be
considered members of a genome with a name derived from the
file_path value. Even is NCBI blast processing is not being
preformed this option is useful when no appropriate input
is available for the --gg_file option.
--mixed_genome_fasta file_path
Describes a protein sequence file to be used as input for
NCBI blast processing. The genome name to associate with
a protein sequence will be derived from the text enclosed by
[] at the end of the sequences > name line.
Even is NCBI blast processing is not being
preformed this option is useful when no appropriate input
is available for the --gg_file option.
--blast_file file_path
Allows FastOrtho to use pre-computed NCBI blast output generated
using legacy blastall -m 8 or current blastp -outfmt 6 output format.
--bpo_file file_path
Allows FastOrtho to use *.bpo which were generated by classic
orthomcl using NCBI blast output.
--gg_file file_path
Classic orthomcl generates *.gg files to detail the genomes
involved in a project and which gene names below to which genomes.
FastOrtho generates similar *.glg files. FastOrtho needs this
membership information to operate. If no such files are available
FastOrtho can generate the information from a list of the
protein sequence files used to prepare the input for NCBI blast.
--working_directory directory_path
FastOrtho generates several files during its work flow and needs
a directory where it has permission to create these files.
--project_name file_prefix
All temporary files generted by FastOrtho will begin with
this value be placed in the working_directory
--formatdb_path file_path
Only used when FastOrtho is tasked with running NCBI blast.
Provides text for running makeblastdb executable or
legacy formatdb executable. Not required if executable
will run without a path specification.
--blastall_path file_path
Only used when FastOrtho is tasked with running NCBI blast.
Provides text for running blastp executable or
legacy blastp executable. Not required if executable
will run without a path specification.
--mcl_path file_path
Allows FastOrtho to apply its input to the mcl program.
Not required if simple mcl will execute from the command line.
--result_file file_path
Specifies where FastOrtho should store its final results.
--single_genome_fasta file_path
Describes a protein sequence file to be used as input for
NCBI blast processing. All sequences in this file will be
considered members of a genome with a name derived from the
file_path value. Even is NCBI blast processing is not being
preformed this option is useful when no appropriate input
is available for the --gg_file option.
--mixed_genome_fasta file_path
Describes a protein sequence file to be used as input for
NCBI blast processing. The genome name to associate with
a protein sequence will be derived from the text enclosed by
[] at the end of the sequences > name line.
Even is NCBI blast processing is not being
preformed this option is useful when no appropriate input
is available for the --gg_file option.
//////// SUPPORT FOR BLAST HITS WITH NON-STANDARD COLUMN ARRANGEMENTS //////
--query_index numeric_value
FastOrtho expects blast hit data in column format. This value
specifies where to read the query name.
default = 0
if FastOrtho is running NCBI blast or using
--blast_file input.
default = 1
if FastOrtho is using --bpo_file input
This option allows the use of files that are similar to those
produced by NCBI blast but with different column placements
--subject_index numeric_value
see --query_index with defaults 1, 3 instead of 0, 1
--e_value_index numeric_value
see --query_index with defaults 10, 5 instead of 0, 1
--percent_idenity_index numeric_value
see --query_index with defaults 2, 6 instead of 0, 1
--alignment_length_index numeric_value
see --query_index with default 3 instead of 0
(does not apply to --bpo_file)
--query_start_index numeric_value
see --query_index with default 6 instead of 0
(does not apply to --bpo_file)
--query_end_index numeric_value
see --query_index with default 7 instead of 0
(does not apply to --bpo_file)
--query_length_index numeric_value
see --query_index with default x, 2 instead of 0, 1
(Only applies to --bpo_file)
--subject_start_index numeric_value
see --query_index with default 8 instead of 0
(does not apply to --bpo_file)
--subject_end_index numeric_value
see --query_index with default 9 instead of 0
(does not apply to --bpo_file)
--subject_length_index numeric_value
see --query_index with default x, 4 instead of 0, 1
(Only applies to --bpo_file)
--mapping_index numeric_value
see --query_index with default x, 7 instead of 0, 1
(Only applies to --bpo_file)
--split_char single_character
Specified character used to separate columns in blast hit file.
default = tab
if FastOrtho is running NCBI blast or using
--blast_file input.
default = ;
if FastOrtho is using --bpo_file input
--use_tab_split
Equivalent to --split_char single_character where
single_character = tab
/////////// SPECIAL FLAGS ///////////////
--match_OrthoMcl
Insures that FastOrtho uses exact logic of classic orthomcl.
In classic orthomcl discarding a paralog blast hits because of
low percent identity will block all subsequent paralog hits in
the same query block even if they pass of all the other blast
hit filtering. This did not seem reasonable and is not the
default behavior of FastOrtho
--legacy_blast
Only used when FastOrtho handles launching NCBI blast.
Tells FastOrtho to use formatdb & blastall instead of the
defaults makeblastdb & blastp. When using legacy NCBI blast
this option needs to be included even if --formatdb_path
and --blastall_path have been specified since the legacy programs
use different strings for specifying option values.
/////////////////////////////// SAMPLE EXAMPLES OF TEXT LINES FOR --option-file
//// Smallest option set
/// Assumes the $PATH environmental variable will provide the locations of
/// mcl and the NCBI program makeblastdb and blastp
/// final result will be found in /home/mscott/projects/samples/version1.end
--mixed_genome_fasta /home/mscott/fasta/samples.faa
--working_directory /home/mscott/projects/samples
--project_name version1
//// Smallest option set where a blast file has been provided
/// samples.faa specification is required to link proteins to their genome
/// final result will be found in /home/mscott/projects/samples/version2.end
--mixed_genome_fasta /home/mscott/fasta/samples.faa
--blast_file /home/mscott/projects/samples/version1.out
--working_directory /home/mscott/projects/samples
--project_name version2
/// short option set using --single_genome_fasta instead of --mixed_genome_fasta
// final result will be found in /home/mscott/projects/samples/version1.end
// genome names will consist of organism_A, organism_B, and organism_C
--single_genome_fasta /home/mscott/fasta/organism_A.faa
--single_genome_fasta /home/mscott/fasta/organism_B.faa
--single_genome_fasta /home/mscott/fasta/organism_C.faa
--working_directory /home/mscott/projects/samples
--project_name version3