Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean overdose death dataset #5

Closed
katelyn-hucker opened this issue Nov 7, 2023 · 8 comments
Closed

Clean overdose death dataset #5

katelyn-hucker opened this issue Nov 7, 2023 · 8 comments
Assignees

Comments

@katelyn-hucker
Copy link
Contributor

No description provided.

@katelyn-hucker katelyn-hucker self-assigned this Nov 10, 2023
@katelyn-hucker
Copy link
Contributor Author

I am going to start working on this!

@katelyn-hucker
Copy link
Contributor Author

array(['Drug poisonings (overdose) Unintentional (X40-X44)',
       'All other alcohol-induced causes',
       'All other non-drug and non-alcohol causes',
       'Drug poisonings (overdose) Suicide (X60-X64)',
       'All other drug-induced causes',
       'Drug poisonings (overdose) Undetermined (Y10-Y14)',
       'Alcohol poisonings (overdose) (X45, X65, Y15)'], dtype=object)

Here are the different categories for cause of deaths... I am assuming we need to just filter by causes that say drug overdose then look at the attached codes for each overdose category. However, do I need to keep the categories so we are able to predict the drug overdose categories that are missing .

One method I am proposing for missing data is to take the total deaths in a county - the other categories = remaining deaths... then do some sort of manipulation for remaining categories. This is just one idea... please let me know if on the wrong track.

@katelyn-hucker katelyn-hucker pinned this issue Nov 11, 2023
@katelyn-hucker
Copy link
Contributor Author

Please see most up to date work on this issue here

@katelyn-hucker
Copy link
Contributor Author

@lisawym I looked into the questions you asked and converted everything into text files in case u want to look at them. I pushed it to my local branch. I am just waiting for total population so that I can filter by larger counties and get mortality rates which can then be used to help fill in missing data. Find that work here : https://github.com/MIDS-at-Duke/opioid-2023-kml/tree/usVital_missing_data

let me know if u have questions

@lisawym
Copy link
Contributor

lisawym commented Nov 27, 2023

Based on our earlier discussion in Slack and our earlier feedback from our instructor. I think we can handle the missing values in the following way by setting a population threshold for the counties, and dropping the counties with smaller population. We can find the threshold of the counties in the below way:

  1. We find all the unique counties with missing values in deaths.
  2. We find the population of the counties with missing values.
  3. We set the population threshold to be the highest population of missing counties. (for example, suppose county ABC has NA values in 2005, and the population of ABC in 2005 is 29,740. And 29,740 is the highest number of population we have among other county/year combinations. And we can set our threshold to be 30,000)
  4. We dropped all the counties with population less than the threshold we set. As a result, all the rows with missing values will be dropped. And we are not biased against the counties with missing values. Because we are selecting observations based on predictors' value, and all other counties with low population will also be dropped.

We talked about another approach - get total death data and compute the missing values. I think it is really good, but it's a little bit complicated, I am not sure if we have time to implement it.
Also, if you are interested in the mortality rate approach, and find ways to implement it. Please go with it!
@katelyn-hucker

@katelyn-hucker
Copy link
Contributor Author

I agree that the other approach would work but getting total death data for more recent years was very challenging I browsed for about 2-3 hours yesterday.

I have already done step 1 (see text file).

I am worried setting the population threshold like that will very much minimize our data especially in more rural states like WV, TX, and TN (control states). We might have to set this population threshold by STATE, to properly account for varying size and geographically different states. I think this is the way to help fix the missing data but we may still be left with some missing data between years. The last step would be to use the rates but this needs both drug deaths/total population.

Are you going to begin this merging step or should I? I will submit a pull request for my branch so that someone can begin merging our two datasets together to hopefully somewhat fix missing data.
@lisawym

@lisawym
Copy link
Contributor

lisawym commented Nov 27, 2023

Yeah, valid point, that states are different in nature, and we might overlooked some rural areas with the threshold.

Would you like to merge the population data and see if we can get a better understanding of the data with population data? Hopefully we will be lucky to find some reasonable and easy way to deal with the missing data.

I noticed that there is a county code in death dataset. I think it is related to FIPS code. I add extra 0 to the beginning of the states with only one digit for state code but the death dataset omit the 0. Please let me know if there are any issues merging the population data!

At the same time, I think I will take a look at the transaction data, and see if there is a FIPS code in it. If not, I will try to add a column for FIPS based on the county name. And see how we can merge population to transaction data to calculate the transaction per capita for the other research question.

Thanks! @katelyn-hucker

@katelyn-hucker
Copy link
Contributor Author

Good Idea on how to split this up. I will try to merge population with us vital stats. I'm going to submit a pull request for my branch first

@katelyn-hucker katelyn-hucker unpinned this issue Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants