Skip to content

Commit

Permalink
simplify the query to get all patients that have clinical data
Browse files Browse the repository at this point in the history
  • Loading branch information
fedorov committed May 22, 2023
1 parent b3dbbd3 commit 609b4d1
Showing 1 changed file with 16 additions and 24 deletions.
40 changes: 16 additions & 24 deletions notebooks/clinical_data_intro.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"\n",
"Prepared: July 2022\n",
"\n",
"Updated: Dec 2022"
"Updated: May 2023"
]
},
{
Expand Down Expand Up @@ -76,7 +76,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"id": "o8WdiIiBQwav"
},
Expand Down Expand Up @@ -712,7 +712,9 @@
"\n",
"Sometime you may want to know whether specific patient has any clinical data available. One way to do this is to locate the collection that patient belongs to, and then check whether any of the clinical data tables (if any) that are available for that collection have that patient identifier.\n",
"\n",
"Alternatively, we can build a complete list of patients that have clinical data by performing a union on all of the `dicom_patient_id` columns across all of the clinical data tables, which is what we do in the next cell."
"Alternatively, we can build a complete list of patients that have clinical data by performing a union on all of the `dicom_patient_id` columns across all of the clinical data tables, which is what we do in the next cell.\n",
"\n",
"In this query we use the ability of BigQuery to [query multiple tables using a wildcard table](https://cloud.google.com/bigquery/docs/querying-wildcard-tables). Note that here we refer to the specific version of the data, since `idc_current_clinical` dataset contains views, which cannot be queried through prefix."
]
},
{
Expand All @@ -725,30 +727,20 @@
"source": [
"import re\n",
"\n",
"all_clinical_tables = column_metadata_df[\"table_name\"].unique()\n",
"query = \"with patients_unionized as (SELECT dicom_patient_id FROM \"+re.sub(\"idc_v[0-9]*_clinical\", \"idc_current_clinical\", all_clinical_tables[0])\n",
"for clinical_table in all_clinical_tables[1:]:\n",
" query = query+\" UNION ALL SELECT dicom_patient_id FROM \"+re.sub(\"idc_v[0-9]*_clinical\", \"idc_current_clinical\", clinical_table)\n",
"\n",
"selection_query = query+\") select distinct(dicom_patient_id) from patients_unionized\"\n",
"\n",
"#print(selection_query)\n",
"selection_query = \"\"\"\n",
"SELECT\n",
" DISTINCT(dicom_patient_id)\n",
"FROM\n",
" `bigquery-public-data.idc_v14_clinical.*`\n",
"WHERE\n",
" _TABLE_SUFFIX NOT IN (\"table_metadata\",\n",
" \"column_metadata\" )\n",
"\"\"\"\n",
"\n",
"selection_result = bq_client.query(selection_query)\n",
"patients_df = selection_result.result().to_dataframe()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ca63J0HWiXjH"
},
"outputs": [],
"source": [
"patients = patients_df[\"dicom_patient_id\"].unique().tolist()\n",
"patients_df = selection_result.result().to_dataframe()\n",
"\n",
"print(\"\\n\".join(patients))"
"patients_df\n"
]
},
{
Expand Down

0 comments on commit 609b4d1

Please sign in to comment.