From fc36f009d59fb3c83309cab380d0225b59d19893 Mon Sep 17 00:00:00 2001 From: yutaro-tanaka-yt2705 Date: Wed, 3 Apr 2024 17:35:21 -0400 Subject: [PATCH] small description changes --- README.md | 2 +- onsides_intl/LABELDATA_INTL.md | 68 ---------------------------------- onsides_intl/ONSIDES_INTL.md | 2 +- 3 files changed, 2 insertions(+), 70 deletions(-) delete mode 100644 onsides_intl/LABELDATA_INTL.md diff --git a/README.md b/README.md index 1db3006..7c30149 100644 --- a/README.md +++ b/README.md @@ -104,7 +104,7 @@ See [DATABASE](DATABASE.md) for a step-by-step walkthrough. ## Limitations -OnSIDES is **strictly** intended for research purposes. The adverse drug event term extraction method is far from perfect - some side effects will be missed and some predicted as true adverse events will be incorrect. +OnSIDES is **strictly** intended for academic research purposes. The adverse drug event term extraction method is far from perfect - some side effects will be missed and some predicted as true adverse events will be incorrect. **Patients/healthcare professionals seeking health information should not trust or use this data, and instead refer to the information available from their regions' respective drug regulatory agencies, such as the [FDA](https://www.fda.gov/) (USA), [EMA](https://www.ema.europa.eu/en) (EU), [MHRA](https://www.gov.uk/government/organisations/medicines-and-healthcare-products-regulatory-agency) (UK), [PMDA](https://www.pmda.go.jp/english/) (Japan) and consult their healthcare providers for information.** diff --git a/onsides_intl/LABELDATA_INTL.md b/onsides_intl/LABELDATA_INTL.md deleted file mode 100644 index e11c2b3..0000000 --- a/onsides_intl/LABELDATA_INTL.md +++ /dev/null @@ -1,68 +0,0 @@ -## Raw Drug Label Datasets - -To make further language model training / downstream analysis of drug label text more accessible (not limited to ADE analysis), we have formatted all of the text in drug labels publicly available for the UK, EU, and Japan as CSV and per-drug-label XML files. Each nation has a standardized drug label format, which we mirror in the XML files as much as possible, but make some alterations for ease of use in text mining, as described below. The standardized format for each nation / region is described briefly in the [LABELSCHEMA_INTL.md](LABELSCHEMA_INTL.md). - - -The files can be downloaded from [here](.tbd). - ---- - -### XML Files - -The XML files are presented per-individual drug, and the structure is kept as consistent as possible. The structure follows - -``` - - - ... - ... - ... - ... - ... - ... - - - ..text.. - ... - - -``` ----- -### CSV Files - -There are two CSV files per nation / region, which contain all of the information for all of the drugs | `drug_info.csv` and `drug_content.csv`. - -#### `drug_info.csv` - -each individual drug label is a row. - -| Column | Description | -| --| ----------| -| product_id | id of the drug label. | -| drug_name | name of drug product. | -| ingredients | comma-separated list of ingredients in drug. | -| about | regulatory classification of drug. | -| last_updated | date of last update to drug label. | -| company_title | marketing / manufacturer of the drug. | -| contact_items | a dictionary of contact information related to the marketing / manufacturer of the drug. | - - -#### `drug_content.csv` - -each section of a drug label is a row. - -| Column | Description | -| -- | ----------| -| product_id | id of the drug label. -| section_id | id of the section. (used in XML files) -| section_title | title of the section. -| section_content | all content in the section. here, it is kept in html format to preserve any formatting. simple code such as ```BeautifulSoup(x).text``` can be used to extract just the text from the section. | - ---- -### Generating these Files - -If one wants to generate/adapt these files for each nation/region, the code to change the obtained raw data to XML/CSV files is located in each subfolder. They are currently provided as notebooks, but they may be adapted into scripts at a later date. - -- [EU : 1-1.data_to_xmlcsv.ipynb](./onsides_eu/notebooks/1-1.data_to_xmlcsv.ipynb) -- [UK : 1-1.data_to_xmlcsv.ipynb](./onsides_uk/notebooks/1-1.data_to_xmlcsv.ipynb) -- [JP : 1-1.data_to_xmlcsv.ipynb](./onsides_jp/notebooks/1-1.data_to_xmlcsv.ipynb) \ No newline at end of file diff --git a/onsides_intl/ONSIDES_INTL.md b/onsides_intl/ONSIDES_INTL.md index e0cb4ff..7343de1 100644 --- a/onsides_intl/ONSIDES_INTL.md +++ b/onsides_intl/ONSIDES_INTL.md @@ -1,6 +1,6 @@ # OnSIDES-INTL -Here, we generate databases mirroring the OnSIDES database (which extracts ADE data from US FDA SPL drug labels) from UK (EMC), EU (EMA), and Japan (PMDA) drug labels. We have also generated uniformly processed drug label text data from these drug labels that can be used as raw, structured data to train a myriad of machine learning models. +Here, we generate databases mirroring the OnSIDES database (which extracts ADE data from US FDA SPL drug labels) from UK (EMC), EU (EMA), and Japan (PMDA) drug labels. While we follow a similar ADE extraction/prediction philosophy to OnSIDES, as the raw label are formatted in a slightly different manner for each nation/region, the technical workflow is slightly adjusted to each nation. The detailed methods are described in [DATABASE_INTL](./DATABASE_INTL.md).