Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot specify columns in the dataset #1

Open
mightyphil2000 opened this issue Nov 30, 2020 · 2 comments
Open

Cannot specify columns in the dataset #1

mightyphil2000 opened this issue Nov 30, 2020 · 2 comments

Comments

@mightyphil2000
Copy link
Contributor

I'm getting this error message when I try to specify columns in the dataset:

x$determine_columns(list(chr_col="CHR", snp_col="rs", pos_col="BP", oa_col="other_allele", ea_col="eff_allele", eaf_col="Fctrl", beta_col="lnor", se_col="SE", pval_col="P",ncase_col = "ncases",ncontrol_col="nctrls","imp_info_col"="RsqAvg"))

Error in x$determine_columns(list(chr_col = "CHR", snp_col = "rs", pos_col = "BP", :
all(is.numeric(out$beta)) is not TRUE
In addition: Warning message:
Unknown or uninitialised column: beta.

I've checked at the beta column is definitely all numeric.

If I specify the columns using column position I get a different error:

x$determine_columns(list(chr_col=3, snp_col=2, pos_col=4, oa_col=6, ea_col=5, eaf_col=12, beta_col=16, se_col=8, pval_col=9))
Error in .subset2(x, i, exact = exact) : subscript out of bounds

@mvab
Copy link

mvab commented Feb 9, 2021

Hi,

I have the same issue all(is.numeric(out$pval)) is not TRUE

This is how I specify the columns:

x$determine_columns(list(chr_col="CHR", 
                         snp_col="SNP", 
                         pos_col="BP",
                         oa_col="ALLELE0",
                         ea_col="ALLELE1", 
                         eaf_col="A1FREQ", 
                         beta_col="BETA", 
                         se_col="SE",
                         pval_col="P_BOLT_LMM_INF"))

Here it seems to be assigning the columns correctly:

Checking alleles are in A/C/T/G/D/I
0 variants with disallowed characters
Is this how the dataset should look?
tibble [100 × 9] (S3: tbl_df/tbl/data.frame)
 $ chr : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
 $ pos : int [1:100] 10177 10352 11008 11012 13110 13116 13118 13273 14464 14599 ...
 $ ea  : chr [1:100] "A" "T" "C" "C" ...
 $ oa  : chr [1:100] "AC" "TA" "G" "G" ...
 $ beta: num [1:100] 0.003867 -0.000167 -0.003125 -0.003125 -0.001727 ...
 $ se  : num [1:100] 0.00408 0.00419 0.00701 0.00701 0.00929 ...
 $ pval: num [1:100] 0.34 0.97 0.66 0.66 0.85 0.79 0.79 0.12 0.79 0.95 ...
 $ snp : chr [1:100] "rs367896724" "rs201106462" "rs575272151" "rs544419019" ...
 $ eaf : num [1:100] 0.602 0.607 0.914 0.914 0.941 ...
NULL

I think something is happening with the column order in the format function. In my file, the column order is not the same as input arguments in determine_columns (understandably), so I specify them by column name (as above). This leads to the error.
However, if I re-order the columns in my original file to match the order of the arguments in the format_dataset function and save it as a new file, and then try to run format_dataset on this file, it works fine.

column order in the original file:
"CHR" , "BP", "SNP" , "BETA" , "SE" , "ALLELE1" , "ALLELE0", "A1FREQ" , "P_BOLT_LMM_INF"
reordered:
"CHR", "SNP", "BP", "ALLELE0", "ALLELE1", "A1FREQ", "BETA", "SE", "P_BOLT_LMM_INF"

for both I run the same x$determine_columns as above.

So reordering the file before trying to upload is a workaround for now.

@mvab
Copy link

mvab commented Feb 22, 2021

Hi @explodecomputer,

I think I found what is causing this issue (ignore my above investigation).

In determine_columns(), files in the format of IEU GWAS pipeline output are being read okay when rows=100 is specified (example 1).
However, when rows=Inf (inside format_dataset() function) it reads the pval column as <chr>, not as <dbl> (example2). I'm not sure why this happens.

(example1) $ P_BOLT_LMM <dbl> 0.400, 0.940, 0.740, 0.740, 0.790, 0.960,

(example2) $ P_BOLT_LMM <chr> "4.0E-01", "9.4E-01", "7.4E-01", "7.4E-01"

So the is.numeric() check fails.

My suggestions:

  • set pval column data type to numeric manually, here: pval=as.numeric(a[[params$pval_col]])

or

  • use vroom::vroom to read in data instead of fread - vroom recognises the values in the format "7.4E-01" as numeric (plus vroom is a bit faster than fread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants