-
Notifications
You must be signed in to change notification settings - Fork 0
Format Checking
The check_format tool can be used to check the file format of multiple input files. An input directory may be specified (otherwise, this is the current working directory), and one or more files may be given.
usage: snp_conversion check_format [-h] [--input-dir INPUT_DIR]
[--file-list FILE_LIST]
[--input-format {TOP,FWD,AB,LONG,DESIGN,PLUS,AFFY,AFFY-PLUS,mixed}]
[--get-snp-panel] [--conversion CONVERSION]
--assembly ASSEMBLY --species SPECIES [-v]
[-s] [--tabular] [--plink]
Required options: --assembly, --species
There are two ways to run check_format:
- Specifying an input format with the --input-format option. Use this option when:
- all input file(s) are of the same format
- a file is not correctly formatted, but you believe it to be in the specified format and wish to see a list of incorrect positions, given the specified format
- Not specifying an input format by not including the --input-format option, OR specifying
--input-format mixed
. Use this option when:
- you are not sure of the format of your file(s)
- you are checking multiple files in different formats
--input-dir INPUT_DIR
Directory containing input file(s) (default directory:
current working directory)
--file-list FILE_LIST
[Optional] Comma-separated list of input files in the
input directory
--input-format {TOP,FWD,AB,LONG,DESIGN,PLUS,AFFY,AFFY-PLUS,mixed}
Type of file(s) expected: 'TOP', 'FWD', 'AB', 'LONG',
'DESIGN', 'PLUS', 'AFFY', AFFY-PLUS, or (default)
'mixed'
--get-snp-panel [Optional] Display the selected genotype conversion
key file
--conversion CONVERSION
Directory containing genotype conversion key files
(default directory: variant_position_files)
--assembly ASSEMBLY Assembly name (use conversion_list tool for all
available choices)
--species SPECIES Species name (use conversion_list tool for all
available choices)
-v, --verbose-logging
[Optional] Write progress messages to an output
*-[timestamp].log file
-s, --summary Summarize converted SNP file in *_summary.txt file
--tabular Output summary file in tabular format (default: False)
--plink Creates PLINK flat files (PED and MAP) (default:
False)
By default, 95% of markers in these input files must be correct for the program to make a prediction. However, this value can be changed by editing the variable minimum_correct_snp_fraction
at the top of lib/check_format.py
.
For information on input files, see the Input Files page.
The check_format tool requires genotype conversion file information to be specified with the --conversion
, --species
, and --assembly
options. See Genotype Conversion Files for information.
When the check_format
utility is run on a file, the main output is one of the
following statements:
- "File [filename] is correctly formatted in [format] format"
- "File [filename] may be in [format] format with [x] inconsistent SNPs"
- "File type for [filename] could not be determined: too many SNPs with inconsistent formatting"
Failure to determine the file format occurs when there are fewer correct SNPs than the minimum correct SNP fraction (default: 0.95) as described above.
If the file type is specified using the parameter --input-format
and is not
"mixed", inconsistent positions are printed to an output file called
[file_basename]-[timestamp].log
. This file contains a table with the structure
Sample | Name | User Input | Conversion Key |
---|
Summary files and PLINK flat files (PED and MAP) can be generated using check_format with the -s, --summary
and --plink
options, respectively. For more information on these files, see Additional Output Files.