Skip to content

Commit

Permalink
long time ... 1.9.1
Browse files Browse the repository at this point in the history
  • Loading branch information
mbaudis committed Sep 12, 2024
1 parent 49afff1 commit 51c4ebc
Show file tree
Hide file tree
Showing 59 changed files with 5,465 additions and 560 deletions.
70 changes: 32 additions & 38 deletions docs/generated/argument_definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,13 @@ cohort ids
- `type`: `string`
**cmdFlags:** `--filters`
**description:**
prefixed filter values, comma concatenated
prefixed filter values, comma concatenated; or objects in POST

### `filter_precision`
**type:** string
**cmdFlags:** `--filterPrecision`
**description:**
`either` start or `exact` (`exact being internal default`) for matching filter values
either `start` or `exact` for matching filter values
**default:** `exact`

### `filter_logic`
Expand Down Expand Up @@ -136,7 +136,7 @@ chromosome

### `mate_name`
**type:** string
**db_key:** location.sequence_id
**db_key:** adjoined_sequences.sequence_id
**pattern:** `^\w+.*?\w?$`
**cmdFlags:** `--mateName`
**description:**
Expand Down Expand Up @@ -186,6 +186,22 @@ genomic start position
**description:**
genomic end position

### `mate_start`
**type:** integer
**db_key:** adjoined_sequences.start
**pattern:** `^\d+?$`
**cmdFlags:** `--mateStart`
**description:**
genomic start position of fusion partner breakpoint region

### `mate_end`
**type:** integer
**db_key:** adjoined_sequences.end
**pattern:** `^\d+?$`
**cmdFlags:** `--MateEnd`
**description:**
genomic end position of fusion partner breakpoint region

### `variant_min_length`
**type:** integer
**db_key:** info.var_length
Expand Down Expand Up @@ -339,12 +355,14 @@ variant ids
**cmdFlags:** `--debugMode`
**description:**
debug setting
**default:** `False`

### `show_help`
**type:** boolean
**cmdFlags:** `--showHelp`
**description:**
specific help display
**default:** `False`

### `test_mode_count`
**type:** integer
Expand All @@ -370,13 +388,6 @@ For defining a special output format, mostly for `byconaut` services use. Exampl
**description:**
only used for web requests & testing

### `only_handovers`
**type:** boolean
**default:** `False`
**cmdFlags:** `--onlyHandovers`
**description:**
only used for web requests & testing

### `method`
**type:** string
**cmdFlags:** `--method`
Expand All @@ -386,34 +397,29 @@ special method

### `group_by`
**type:** string
**cmdFlags:** `-g,--groupBy`
**cmdFlags:** `--groupBy`
**description:**
group parameter e.g. for subset splitting
**default:** `text`

### `parse`
**type:** string
**cmdFlags:** `-p,--parse`
**description:**
input value to be parsed

### `mode`
**type:** string
**cmdFlags:** `-m,--mode`
**description:**
mode, e.g. file type

### `key`
**type:** string
**cmdFlags:** `-k,--key`
**description:**
some key or word

### `update`
**type:** string
**type:** boolean
**cmdFlags:** `-u,--update`
**description:**
update existing records
update existing records - might be deprecated; only used for publications
**default:** `False`

### `force`
**type:** boolean
**cmdFlags:** `--force`
**description:**
force mode, e.g. for update or insert (cmd line)
**default:** `False`

### `inputfile`
Expand Down Expand Up @@ -448,19 +454,13 @@ random number to limit processing, where supported
minimal number, e.g. for collations, where supported
**default:** `0`

### `source`
**type:** string
**cmdFlags:** `-s,--source`
**description:**
some source label, e.g. `analyses`

### `delivery_keys`
**type:** array
**items:**
- `type`: `string`
**cmdFlags:** `--deliveryKeys`
**description:**
delivery keys
delivery keys to force only some parameters in custom exporters

### `collation_types`
**type:** array
Expand All @@ -470,12 +470,6 @@ delivery keys
**description:**
selected collation types, e.g. "EFO"

### `with_samples`
**type:** integer
**cmdFlags:** `--withSamples`
**description:**
only for the collations; number of code_matches...

### `selected_beacons`
**type:** array
**items:**
Expand Down
74 changes: 74 additions & 0 deletions docs/housekeeping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Housekeeping
---

Recurring "housekeeping" functions are provided by dedicated scripts with
eponymoys functionality located in the `housekeepers` directory (e.g. `deleteAnalyses.py`
is used for deleting records from teh `analyses` collection; `deleteBiosamplesWDS.py`
deletes analyses and their downtream records - _i.e._ the associated analyses and
the variants from those analyses). Additionally there is a separate `housekeeping.py`
app for executing a number of standard maintenance functions in sequential order.

Functions for importing and updating (for now) reside in the separate `importers` directory.

## General Options

Most housekeepers (and other) apps have some general options:

* `--testMode true` will run a test setting, e.g. for deletion apps only indicate
the numbers to be deleted w/o actually remving records
- most destructive apps will fall back to test mode by default and ask for confirmation
* `--limit 0` will perform the selected action on all records instad of a build-in
default wherease e.g. `--limit 5` will just process a maximum of 5 records
* `--force true` will perform the selected action even if there have been warnings
or errors written to the pre-processor log file; onme is usually prompted for this

## Creating Collations - `collationsCreator.py`

The `collationsCreator` script updates the dataset specific `collations` collections
which provide the aggregated data (sample numbers, hierarchy trees etc.) for all
individual codes belonging to one of the entities defined in the `filter_definitions`
in the `bycon` configuration. The (optional) hierarchy data is provided
in `rsrc/classificationTrees/__filterType__/numbered-hierarchies.tsv` as a list
of ordered branches in the format `code | label | depth | order`.

**TBD** The filter definition should be one of the configuration where users can
provide additions and overrides in the `byconaut/local` directory.

### Arguments

* `-d`, `--datasetIds` ... to select the dataset (only one per run)
* `--filters` ... to (optionally) limit the processing to a subset of samples
(e.g. after a limited update)

### Use

* `bin/collationsCreator.py -d progenetix`
* `bin/collationsCreator.py -d examplez --collationTypes "PMID"`


## Pre-computing Binned CNV Frequencies - `frequencymapsCreator`

This app creates the frequency maps for the "collations" collection. Basically,
all samples matching any of the collation codes and representing CNV analyses
are selected and the frequencies of CNVs per genomic bin are aggregated. The
result contains teh gain and loss frquencies for all genomic intervals, for the
given entity.

### Arguments

* `-d`, `--datasetIds` ... to select the dataset (only one per run)
* `--collationTypes` ... to (optionally) limit the processing to a selected
collation types (e.g. `NCIT`, `PMID`, `icdom` ...)

### Use

* `bin/frequencymapsCreator.py -d progenetix`
* `bin/frequencymapsCreator.py -d examplez --collationTypes "icdot"`

## Deleting Records

Records are deleted by providing a standard pgx-style tab-delimited metadata file
where only the corresponding `..._id` column is essential. As example, the
`deleteIndividuals.py` app will take a table which includes a column `individual_id`
and use these values to delete the matching records.
43 changes: 26 additions & 17 deletions housekeepers/analysesStatusmapsRefresher.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
sys.path.append( services_lib_path )
from interval_utils import generate_genome_bins, interval_cnv_arrays
from collation_utils import set_collation_types
from service_helpers import ask_limit_reset

"""
Expand All @@ -27,13 +28,14 @@
################################################################################

def main():
callsets_refresher()
analyses_refresher()

################################################################################

def callsets_refresher():
def analyses_refresher():
initialize_bycon_service()
generate_genome_bins()
ask_limit_reset()

if len(BYC["BYC_DATASET_IDS"]) > 1:
print("Please give only one dataset using -d")
Expand All @@ -43,25 +45,32 @@ def callsets_refresher():
set_collation_types()
print(f'=> Using data values from {ds_id} for {BYC.get("genomic_interval_count", 0)} intervals...')

limit = BYC_PARS.get("limit", 0)
data_client = MongoClient(host=DB_MONGOHOST)
data_db = data_client[ ds_id ]
cs_coll = data_db[ "analyses" ]
v_coll = data_db[ "variants" ]

record_queries = ByconQuery().recordsQuery()
ds_results = {}
if len(record_queries["entities"].keys()) > 0:
DR = ByconDatasetResults(ds_id, record_queries)
ds_results = DR.retrieveResults()

prdbug(record_queries)

ds_results = execute_bycon_queries(ds_id, record_queries)

if not ds_results.get("analyses._id"):
if not ds_results.get("analyses.id"):
print(f'... collecting analysis id values from {ds_id} ...')
cs_ids = []
for cs in cs_coll.find( {} ):
cs_ids.append(cs["_id"])
c_i = 0
for ana in cs_coll.find( {} ):
c_i += 1
cs_ids.append(ana["id"])
if limit > 0:
if limit == c_i:
break
cs_no = len(cs_ids)
print(f'¡¡¡ Using all {cs_no} analyses from {ds_id} !!!')
print(f'¡¡¡ Using {cs_no} analyses from {ds_id} !!!')
else:
cs_ids = ds_results["analyses._id"]["target_values"]
cs_ids = ds_results["analyses.id"]["target_values"]
cs_no = len(cs_ids)

print(f'Re-generating statusmaps with {BYC["genomic_interval_count"]} intervals for {cs_no} analyses...')
Expand All @@ -74,24 +83,24 @@ def callsets_refresher():
exit()

no_cnv_type = 0
for _id in cs_ids:
for ana_id in cs_ids:

cs = cs_coll.find_one( { "_id": _id } )
csid = cs["id"]
ana = cs_coll.find_one( { "id": ana_id } )
_id = ana.get("_id")
counter += 1

bar.next()

if "SNV" in cs.get("variant_class", "CNV"):
if "SNV" in ana.get("variant_class", "CNV"):
no_cnv_type += 1
continue

# only the defined parameters will be overwritten
cs_update_obj = { "info": cs.get("info", {}) }
cs_update_obj = { "info": ana.get("info", {}) }
cs_update_obj["info"].pop("statusmaps", None)
cs_update_obj["info"].pop("cnvstatistics", None)

cs_vars = v_coll.find({ "analysis_id": csid })
cs_vars = v_coll.find({ "analysis_id": ana_id })
maps, cs_cnv_stats, cs_chro_stats = interval_cnv_arrays(cs_vars)

cs_update_obj.update({"cnv_statusmaps": maps})
Expand Down
17 changes: 9 additions & 8 deletions housekeepers/collationsCreator.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ def collations_creator():
collationed = coll_defs.get("collationed")
if not collationed:
continue
pre = coll_defs["namespace_prefix"]
pre_h_f = path.join( pkg_path, "rsrc", "classificationTrees", coll_type, "numbered_hierarchies.tsv" )
collection = coll_defs["scope"]
db_key = coll_defs["db_key"]
Expand Down Expand Up @@ -81,7 +80,7 @@ def collations_creator():
no = len(hier.keys())
matched = 0
if not BYC["TEST_MODE"]:
bar = Bar("Writing "+pre, max = no, suffix='%(percent)d%%'+" of "+str(no) )
bar = Bar("Writing "+coll_type, max = no, suffix='%(percent)d%%'+" of "+str(no) )
for count, code in enumerate(hier.keys(), start=1):
if not BYC["TEST_MODE"]:
bar.next()
Expand All @@ -99,7 +98,6 @@ def collations_creator():
else:
child_no = data_coll.count_documents( { db_key: { "$in": children } } )
if child_no > 0:
# sub_id = re.sub(pre, coll_type, code)
sub_id = code
update_obj = hier[code].copy()
update_obj.update({
Expand Down Expand Up @@ -127,7 +125,7 @@ def collations_creator():
if not BYC["TEST_MODE"]:
sel_hiers.append( update_obj )
else:
print(f'{sub_id}:\t{code_no} ({child_no} deep) samples - {count} / {no} {pre}')
print(f'{sub_id}:\t{code_no} ({child_no} deep) samples - {count} / {no} {coll_type}')
# UPDATE
if not BYC["TEST_MODE"]:
bar.finish()
Expand Down Expand Up @@ -342,12 +340,14 @@ def _get_child_ids_for_prefix(data_coll, coll_defs):

def _get_label_for_code(data_coll, coll_defs, code):

label_keys = ["label", "description"]
label_keys = ["label", "description", "note"]

db_key = coll_defs["db_key"]
id_key = re.sub(".id", "", db_key)
example = data_coll.find_one( { db_key: code } )

# prdbug(f'{db_key} - example {example}')

if id_key in example.keys():
if isinstance(example[ id_key ], list):
for o_t in example[ id_key ]:
Expand All @@ -356,14 +356,15 @@ def _get_label_for_code(data_coll, coll_defs, code):
if k in o_t:
return o_t[k]
continue
else:
elif type(example[ id_key ]) is object:
o_t = example[ id_key ]
if code in o_t["id"]:
if code in o_t.get("id", "___none___"):
for k in label_keys:
if k in o_t:
return o_t[k]

return ""

return code

################################################################################
################################################################################
Expand Down
Loading

0 comments on commit 51c4ebc

Please sign in to comment.