Repo designed for interview purposes regarding the data engineering role at Eleanor Health
This repo requires the following to be installed on your machine:
- Docker (Easiest way to get started)
- Python 3.9+
- Pipenv
-
Run the following commands to install the dependencies:
python -m pip install pipenv pipenv install --dev
-
Run the following command to execute the python script:
pipenv run python main.py ./sample/memberlist.csv
-
Run the following command to build the docker image:
docker build -t data-eng-exercise .
-
Run the following command to execute the docker image:
docker run -it --rm -v ${PWD}/sample:/app/sample data-eng-exercise ./sample/memberlist.csv
Each task should be completed in the main.py
file. During the interview, you will be asked to walk through
your solution and explain your thought process. You will also be allowed to use the internet to look up
documentation and other resources. If you have any questions, please ask your interviewer, they are there
to help you complete the tasks. This interview should be conducted as if it is a pair programming exercise.
While the interviewer will not solve the problems for you, they can guide you towards a solution or coach you
through a problem if you are stuck.
The tasks are as follows:
Normalize the GENDER
column in the memberlist.csv
file to be either M
, F
, or U
(unknown).
- All values should be capitalized
- All values should be trimmed of whitespace
- All values should be one of the following:
M
F
U
- If the value is not one of the above, it should be set to
U
- Continue processing the file
Log a warning on missing required columns for city and state.
- If the
CITY
orSTATE
column is missing, log a warning with the following message:Missing value in required column: CITY
Missing value in required column: STATE
- Use the
logger
object to log the warning - The
logger
object is already configured for you - Continue processing the file
Use zippy to query zipcodebase for the missing city and state columns on records that are without one and add the missing data to the record.
- If the
CITY
orSTATE
column is missing, query zippy for the missing data - Add the missing data to the record
- Continue processing the file
Zippy is a test service that is designed to mimic the functionality of the zipcodebase api. It is a simple rest endpoint that will return a json response with the city(s) and state(s) for a given list of zipcodes.
The service is available to you and you will be given the url and api key to use it.
curl -X POST -H "Content-Type: application/json"https://{ZIPPY_URL}/locations?tempKey={API_KEY}&zipCodes=28202,90210
or in a web browser go to the following url: https://{ZIPPY_URL}/locations?tempKey={API_KEY}&zipCodes=28202,90210
[
{
"city": "Charlotte",
"state": "North Carolina",
"stateCode": "NC",
"zipCode": "28202",
"country": "US"
},
{
"city": "Beverly Hills",
"state": "California",
"stateCode": "CA",
"zipCode": "90210",
"country": "US"
}
]
All code should follow PEP8 standards and be formatted with Black. This can be done by running the following command:
pipenv run black .
Other tools that are used to ensure code quality are: