Skip to content

Format Checking

ekherman edited this page Jun 16, 2020 · 9 revisions

Checking the format of a SNP panel file

The check_format tool can be used to check the file format of multiple input files. An input directory may be specified (otherwise, this is the current working directory), and one or more files may be given.

usage: snp_conversion check_format [-h] [--input-dir INPUT_DIR]
                                   [--file-list FILE_LIST]
                                   [--input-format {TOP,FWD,AB,LONG,DESIGN,PLUS,AFFY,AFFY-PLUS,mixed}]
                                   [--get-snp-panel] [--conversion CONVERSION]
                                   --assembly ASSEMBLY --species SPECIES [-v]
                                   [-s] [--tabular] [--plink]

Required options: --assembly, --species

Running check_format

There are two ways to run check_format:

  1. Specifying an input format with the --input-format option. Use this option when:
  • all input file(s) are of the same format
  • a file is not correctly formatted, but you believe it to be in the specified format and wish to see a list of incorrect positions, given the specified format
  1. Not specifying an input format by not including the --input-format option, OR specifying --input-format mixed. Use this option when:
  • you are not sure of the format of your file(s)
  • you are checking multiple files in different formats

Options

  --input-dir INPUT_DIR
                        Directory containing input file(s) (default directory:
                        current working directory)
  --file-list FILE_LIST
                        [Optional] Comma-separated list of input files in the
                        input directory
  --input-format {TOP,FWD,AB,LONG,DESIGN,PLUS,AFFY,AFFY-PLUS,mixed}
                        Type of file(s) expected: 'TOP', 'FWD', 'AB', 'LONG',
                        'DESIGN', 'PLUS', 'AFFY', AFFY-PLUS, or (default)
                        'mixed'
  --get-snp-panel       [Optional] Display the selected genotype conversion
                        key file
  --conversion CONVERSION
                        Directory containing genotype conversion key files
                        (default directory: variant_position_files)
  --assembly ASSEMBLY   Assembly name (use conversion_list tool for all
                        available choices)
  --species SPECIES     Species name (use conversion_list tool for all
                        available choices)
  -v, --verbose-logging
                        [Optional] Write progress messages to an output
                        *-[timestamp].log file
  -s, --summary         Summarize converted SNP file in *_summary.txt file
  --tabular             Output summary file in tabular format (default: False)
  --plink               Creates PLINK flat files (PED and MAP) (default:
                        False)

Special options

Incorrect SNP threshold

By default, 95% of markers in these input files must be correct for the program to make a prediction. However, this value can be changed by editing the variable minimum_correct_snp_fraction at the top of lib/check_format.py.

Input files

For information on input files, see the Input Files page.

Genotype Conversion Files

The check_format tool requires genotype conversion file information to be specified with the --conversion, --species, and --assembly options. See Genotype Conversion Files for information.

Program Output

When the check_format utility is run on a file, the main output is one of the following statements:

  • "File [filename] is correctly formatted in [format] format"
  • "File [filename] may be in [format] format with [x] inconsistent SNPs"
  • "File type for [filename] could not be determined: too many SNPs with inconsistent formatting"

Failure to determine the file format occurs when there are fewer correct SNPs than the minimum correct SNP fraction (default: 0.95) as described above.

If the file type is specified using the parameter --input-format and is not "mixed", inconsistent positions are printed to an output file called [file_basename]-[timestamp].log. This file contains a table with the structure

Sample Name User Input Conversion Key

Additional Output Files

Summary files and PLINK flat files (PED and MAP) can be generated using check_format with the -s, --summary and --plink options, respectively. For more information on these files, see Additional Output Files.