-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathworkflow.qmd
109 lines (60 loc) · 11.5 KB
/
workflow.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: "Analytic Workflow for Research Projects"
subtitle: "A guide for data analyst in a research lab"
author: Xilin Chen
date: 08/19/2022
format:
html:
toc: true
toc-depth: 4
editor: visual
---
This document describes the general workflow for data analysis projects in the MedLab.
## Project steps overview
![](workflow_diagram.png){fig-align="center" width="337"}
### Step1: Project kick off meeting
A research project starts with a kick off meeting. We follow the general idea of the [paper print](https://www.uab.edu/ccts/images/Paper_sprints.pdf). In this meeting, the research team, including PI, champion and potential co-authors, should join and have a discussion about research design and research questions. At this stage, everything can be changed and the switching cost is the lowest. For analysts, we need to understand and clarify the research background and questions, so we can have concrete ideas of the project complexity and feasibility. For example, if the data or the important covaraites are available. Be open about different ideas, since we might need to use them later in the analysis exploration. I am usually less open or more critical about new ideas later in the analysis, since the switching cost is high.
Ideally, before the kick off meeting, the analyst should already be communicating with the project champion about research design plan. So the analyst can check the data availability and ask some clarifying questions before the kick-off meeting. This helps the analyst to be more engaging in the kick off meeting.
If the analyst is already familiar with the data or research, it's also helpful to have some descriptive data available in the kick off meeting. Just general table1 info, no need to run any models.
### Step2: Set up analysis project
#### 2.1 communication channels
Before we do analysis, it's important to set up communication channels and expectations with the project champion. We usually have two separate ways. 1. internal team members, for example our research fellows in our lab 2. external collaborators, for example, people who are not familiar or don't have access to our internal communication tools.
**internal team**
- Slack direct message: very light weight communication. Things we never need to refer back in the future. For example, I like to use slack to set up check-in meetings with the champion after each analysis iteration.
- Slack project channel: a unique slack project channel is created for each project. Questions, discussions or anything that you want the whole research project team to see can be posted here. This helps the whole research team to be involved in the project development asynchronously and avoid sending out team-wise emails.
- Asana: task specific communication. Task assignments should be documented on Asana. This is a place where champion and the analyst put the tasks we want to accomplish. Ideally, we should put the "Done" criteria in the task so there is no misunderstandings. Task related discussions are also encouraged to happen here.
- Meetings: when we need to make a decision and have deep discussion about the results and plan next steps based on the results what we currently have. Tip: always summarize the meeting in writing, so the research team can refer back. It can be posted on slack project channel.
- Email: only team-wise communication. Email is usually my last option to communicate project progress.
**external team**
external collaborations are usually small projects. I just use emails to communicate since everyone knows how to read emails. I don't like to impose our system on external collaborators.
#### 2.2 set up a project on CSTAR github
Our lab use [GitKraken](https://www.gitkraken.com/) to commit code to our [github page](https://github.com/UMCSTaR). We have a [template repo](https://github.com/UMCSTaR/repo-template) on github. This template has organized folders that are ideal for complex projects. I usually start with the template when I know the projects can have multiple iterations. When doing a simple projects that don't require multiple iterations, I just start an empty and add a few folders I see appropriate. see [example](https://github.com/UMCSTaR/SIMPL_Tenwek).
When set up projects, take a few minutes to write down the project description in the *About* section on github. This could help the team to locate the project in the future (after months and years). If working on complicated projects, invest some time to fill in the ReadME on github repo. The ReadME template is with the [template repo](https://github.com/UMCSTaR/repo-template).
All the code and markdown documents should be pushed to the github project. But be careful that don't push data to the github page.
#### 2.3 set up a data folder
Data should be saved on Maize or Dropbox. If it's medicare data, save on Maize. Other than Medicare data, we save data on UM Dropbox. Raw data and model results should not be save on github.
Use the same project name on github to name data folder, so it's easier to find the corresponding data and code in the future for the projects. Every project should have it's own data folder.
*Additional tips: make sure the github repo name, data folder name, and* Dropbox folder names *are the same. This helps find us find data and documents easier in the future.*
### Step3: Analysis Plan
The initial analysis plan can be broad and vague. The action items and plan should be developed after each analysis iterations. Having some results help us understand the projects better. If invest too much time in planning everything in the beginning, once the plan has changed (very likely going to be), the effort of the planning would be wasted. It also help us to be less attached to certain research ideas and approaches. Document the analysis plan for each iteration on Asana or Slack project channel where the team can easily have access to. Share the meeting summaries and action items on the Slack project channel, so the other team members who are not in the meeting are aware of the project development. The analysis plan generally only needs to cover one sprint. The next steps are based on the current available results. The analyst should feel free to add more beyond the plan during the analysis, but always cover the agreed upon items.
It's important to set expectations between the project champion and analyst in the analysis plan phase. The project champion should give timely feedback to the analysis results. For example, in the analysis plan meeting, I usually would check the availability of the project champion and let them know what my availability for the project. I update the results with them every week. I expect them to give me feedback of the results within 3 days. This is especially important in the beginning, because we want to make sure the analysis has met the expectations of the champion before we go too far. We should meet after each analysis iteration so we can ask questions and plan for next sprint. If the project champion can't be actively involved in the analysis, I usually won't start the analysis until they have availability to be work closely me. The switching cost is high if we have to pause and restart analysis often.
### Step4: Analysis
Generally, the first round of analysis should be cohort definition and descriptive. Then we can move to model-based analysis and visualization.
The rule of thumb is to share results early and keep in touch with the project champion frequently. Don't spend too much time in formatting or making things perfect. It's better to share the work in progress results early with the project champion and have a meeting to go through the results first. In the meeting, the project champion should give feedback and ask clarifying questions regarding the results. The analyst can share the plan for next steps. Concentrating on sharing results early allows for faster iterations and more transparent communications. I usually would update the project champion every week. Even if you didn't finish the analysis in a week, still offer to show the project champion with what you have and let them know the timeline of when you can finish.
#### 4.1: Cohort Definition Diagram
Cohort definition should be the first thing to do. Make sure to spend time to think, communicate and document the cohort selection process is important. Re-defining cohort later in the analysis process cost a lot time, effort and frustration. Cohort definition should be research theory driven instead of "results" driven. Looking at results to define cohort can lead to p-hacking.
I use [DiagrammeR](https://rich-iannone.github.io/DiagrammeR/) to create the cohort diagram in the CONSORT style. Here is an [example code](https://github.com/UMCSTaR/medicareAnalytics/blob/master/R/cohort_selection_diagram.R) I have used for medicare projects. The benefit of coding diagram is that it helps in multiple iterations scenario. No numbers are hard coded, so if we change our cohort, we don't need to copy and paste numbers which can be very tedious and error-prone.
I don't spend time making the diagram pretty, since it's not easy to do detailed plot editing using DiagrammeR. I save the diagram in svg and edited in omnigraffle.
#### 4.2: lab notebooks
Lab notebooks are used to communicate results with the project champion primarily. Notebooks don't need to be too detailed since they are still work in progress products. It's helpful to schedule meeting with the project champion to go through the results so everyone is on the same page. Also, by having meeting after the notebook results, we don't have to write down every details of the model specs and interpretations. Explanations can be done in the meeting.
I have been generating HTML documents using Rmarkdown. Consider using Quarto for now on though, since that's the future and everyone is switching. I create a lab notebook folder for each project on our lab Dropbox. So every team member can have access to the notebooks but still not publicly accessible. I don't send html documents via email since the recipients have to download the documents in order to view. It's easier to include dropbox inks in the email.
Lab notebooks can be rough in the beginning of the analysis, since the cohort and our analysis usually changes. Concentrating on faster iterations in the early stage is more important than having polished results. However, when the analyses are done, we should have good documents that the champion and the research team can refer back months down the road.
### Step5: Finish Analysis
After the analysis is done, spend sometime to organize the code so that it's easier to reference back in the future. I have used the R package [targets](https://cran.r-project.org/web/packages/targets/index.html) to manage my analytic workflow. But I also found just name the scripts using numbers are also good enough.
## Q & A
1. Where are the key articles that are used for statistical analysis?
important reference articles are saved on Zotero at *UofM MEDLab -\> Research Resources -\> Methods -\> 1. Key Articles*
2. How to use Asana to get feedback from champions and team?
The analyst and project champion should assign tasks under Asana projects. Document "Done" criteria under each tasks if it's not clear. Once the tasks are finished, tag the project champion to review the results and ask for feedback.
3. Where are the project data?
Projects that used medicare data are on maize; Other projects data are on MedLab Dropbox. Each project has its own data folder. The data folder shares the same key word of the project name on github. Note, the project name on github and data folder might be different, but they share keywords that represent the research project.