Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Given a ComputedFile, is there a way to know what processor job generated it? #1868

Open
arielsvn opened this issue Nov 6, 2019 · 3 comments

Comments

@arielsvn
Copy link
Contributor

arielsvn commented Nov 6, 2019

Context

Samples can have many ComputedFiles, which also have ComputationalResults.

Processor jobs create instances of ComputationalResult and ComputedFile when samples are processed.

Problem or idea

We have some samples with multiple computed files, but for each one, it's not obvious what original files were used to generate them. Also, there's no way to know which processor jobs generated them.

For example, sample GSM248431 has multiple computed files.

data_refinery=> select * from computed_files where id in (select computed_file_id from sample_computed_file_associations where sample_id in (select id from samples where accession_code='GSM248431'));
   id    |         filename         |                          absolute_file_path                          | size_in_bytes |                   sha1                   | is_smashable | is_qc | is_qn_target |           s3_bucket            |                      s3_key                       | is_public |          created_at           |         last_modified         | result_id | compendia_organism_id | compendia_version | is_compendia | quant_sf_only | svd_algorithm 
---------+--------------------------+----------------------------------------------------------------------+---------------+------------------------------------------+--------------+-------+--------------+--------------------------------+---------------------------------------------------+-----------+-------------------------------+-------------------------------+-----------+-----------------------+-------------------+--------------+---------------+---------------
 1043270 | GSM248431_GE1002_2_2.PCL | /home/user/data_store/processor_job_1615745/GSM248431_GE1002_2_2.PCL |        143873 | 619f543f7b5ed39180e0e71a69f37b9ad6e11bd8 | t            | f     | f            | data-refinery-s3-circleci-prod | y3wchb8nev4iyz4c6s9ukv0w_GSM248431_GE1002_2_2.PCL | t         | 2018-12-20 14:44:10.890171+00 | 2018-12-20 14:44:11.139879+00 |    927601 |                       |                   | f            | f             | NONE
 1043658 | GSM248431_GE1001_3.PCL   | /home/user/data_store/processor_job_1616335/GSM248431_GE1001_3.PCL   |        143950 | 2688b7951ec72ab49540fcfb4da44d2683715c68 | t            | f     | f            | data-refinery-s3-circleci-prod | kk01w6jrhdp9o0f31mt1pnje_GSM248431_GE1001_3.PCL   | t         | 2018-12-20 14:47:08.301983+00 | 2018-12-20 14:47:12.167675+00 |    927989 |                       |                   | f            | f             | NONE
 1043633 | GSM248431_GE1003_3.PCL   | /home/user/data_store/processor_job_1616313/GSM248431_GE1003_3.PCL   |        143963 | 925810e5b65bdb8e10079e52da93c0c588833589 | t            | f     | f            | data-refinery-s3-circleci-prod | puu08rbp3hoq9aspikktocmd_GSM248431_GE1003_3.PCL   | t         | 2018-12-20 14:47:01.282186+00 | 2018-12-20 14:47:01.845971+00 |    927964 |                       |                   | f            | f             | NONE
 1043207 | GSM248431_GE1004_3.PCL   | /home/user/data_store/processor_job_1615679/GSM248431_GE1004_3.PCL   |        143930 | 4cf20fad148a0632a02df608b3246db9f0992797 | t            | f     | f            | data-refinery-s3-circleci-prod | cxcqnjnuorgarw3kx6www1i0_GSM248431_GE1004_3.PCL   | t         | 2018-12-20 14:43:32.654072+00 | 2018-12-20 14:43:46.704009+00 |    927538 |                       |                   | f            | f             | NONE
 1043592 | GSM248431_GE1002_2_1.PCL | /home/user/data_store/processor_job_1616233/GSM248431_GE1002_2_1.PCL |        143841 | 7fe7f2702e1dca767345906d81a33a07b9031484 | t            | f     | f            | data-refinery-s3-circleci-prod | rcou7jcua1kb9kp2a7ryqzu7_GSM248431_GE1002_2_1.PCL | t         | 2018-12-20 14:46:34.442139+00 | 2018-12-20 14:46:36.158067+00 |    927923 |                       |                   | f            | f             | NONE
 1043205 | GSM248431_GE1003_1.PCL   | /home/user/data_store/processor_job_1615657/GSM248431_GE1003_1.PCL   |        143799 | ad9a3ce7580aca4acb284c531587d8b4ecc25f7a | t            | f     | f            | data-refinery-s3-circleci-prod | 5xnifq0jqtfkxpb9k2mds8ux_GSM248431_GE1003_1.PCL   | t         | 2018-12-20 14:43:32.579096+00 | 2018-12-20 14:43:46.693395+00 |    927536 |                       |                   | f            | f             | NONE
 1043449 | GSM248431_GE1002_1.PCL   | /home/user/data_store/processor_job_1615986/GSM248431_GE1002_1.PCL   |        143796 | 68314b156905068ef1649ed78bd5f7ed14f5ed14 | t            | f     | f            | data-refinery-s3-circleci-prod | ffsdc752f6gyfh6dn31ibow9_GSM248431_GE1002_1.PCL   | t         | 2018-12-20 14:45:31.363883+00 | 2018-12-20 14:45:40.448563+00 |    927780 |                       |                   | f            | f             | NONE
 1044554 | GSM248431_GE1004_1.PCL   | /home/user/data_store/processor_job_1617375/GSM248431_GE1004_1.PCL   |        143816 | 3e72fc3a4f1b4818bc212abc443758511c183b6c | t            | f     | f            | data-refinery-s3-circleci-prod | e1b6si5cgpi02r9n0nyjuwry_GSM248431_GE1004_1.PCL   | t         | 2018-12-20 15:03:08.943553+00 | 2018-12-20 15:03:21.513281+00 |    928887 |                       |                   | f            | f             | NONE
 1043195 | GSM248431_GE1002_3.PCL   | /home/user/data_store/processor_job_1615673/GSM248431_GE1002_3.PCL   |        143916 | 0df30bf5d76da1a2af1a5d4b43381f220a784fc6 | t            | f     | f            | data-refinery-s3-circleci-prod | bm4x03ivvrxt4cd7dzjfsasu_GSM248431_GE1002_3.PCL   | t         | 2018-12-20 14:43:26.870718+00 | 2018-12-20 14:43:42.702274+00 |    927526 |                       |                   | f            | f             | NONE
(9 rows)

And multiple original files:

data_refinery=> select id, filename, is_archive, source_filename from original_files where id in (select original_file_id from original_file_sample_associations where sample_id in (select id from samples where accession_code='GSM248431'));
   id    |         filename         | is_archive |       source_filename       
---------+--------------------------+------------+-----------------------------
 1482345 | GSM248431_GE1002_3.CEL   | f          | GSM248431_GE1002_3.CEL.gz
 1419569 |                          | t          | GSM248431_GE1002_2_1.CEL.gz
 1419302 |                          | t          | GSM248431_GE1002_1.CEL.gz
 1482325 | GSM248431_GE1003_1.CEL   | f          | GSM248431_GE1003_1.CEL.gz
 1482352 | GSM248431_GE1004_3.CEL   | f          | GSM248431_GE1004_3.CEL.gz
 1420599 |                          | t          | GSM248431_GE1003_3.CEL.gz
 1483092 | GSM248431_GE1003_3.CEL   | f          | GSM248431_GE1003_3.CEL.gz
 1483111 | GSM248431_GE1001_3.CEL   | f          | GSM248431_GE1001_3.CEL.gz
 1419866 |                          | t          | GSM248431_GE1002_2_2.CEL.gz
 1484300 | GSM248431_GE1004_1.CEL   | f          | GSM248431_GE1004_1.CEL.gz
 1420090 |                          | t          | GSM248431_GE1002_3.CEL.gz
 1419042 |                          | t          | GSM248431_GE1001_3.CEL.gz
 1483015 | GSM248431_GE1002_2_1.CEL | f          | GSM248431_GE1002_2_1.CEL.gz
 1482723 | GSM248431_GE1002_1.CEL   | f          | GSM248431_GE1002_1.CEL.gz
 1420814 |                          | t          | GSM248431_GE1004_1.CEL.gz
 1482431 | GSM248431_GE1002_2_2.CEL | f          | GSM248431_GE1002_2_2.CEL.gz
 1421034 |                          | t          | GSM248431_GE1004_3.CEL.gz
 1420360 |                          | t          | GSM248431_GE1003_1.CEL.gz
(18 rows)

Solution or next step

I think it makes sense to add a new relation between ComputationalResult and ProcessorJob.

Tagging @kurtwheeler for further discussion.

@kurtwheeler
Copy link
Contributor

I think you're right! I think we should be able to tell what ProcessorJob generated a ComputationalResult. A ComputationalResult will never have more than one ProcessorJob associated with it, so it should just be a processor_job_id property on the ComputationalResult model, one that we'll probably want to not expose via the API? (I think at the moment we aren't exposing anything about jobs via the API.)

@arielsvn
Copy link
Contributor Author

arielsvn commented Nov 8, 2019

one that we'll probably want to not expose via the API? (I think at the moment we aren't exposing anything about jobs via the API.)

Actually we have endpoints to expose all the jobs: /jobs/downloader and /jobs/processor. I started using them to list the jobs for each sample at AlexsLemonade/refinebio-frontend#784. Is there any reason not to expose processor_job_id in the API? Would be nice to be able to inspect the jobs associated with a ComputationalResult via the API

@kurtwheeler
Copy link
Contributor

Nope, no reason at all! I just thought we were trying to hide those deets from our users but honestly I was hoping that we'd eventually change that anyway :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants