coursework-template.Rmd

---
title: "Coursework submission for Transport Data Science (TRAN5340M): resit guidance"
# subtitle: "Enter your own title here (e.g. Exploring open transport data: a study of the Isle of Wight)"
author: "Student 12345"
output: github_document
# for html output:
# output:
#   html_document:
#     number_sections: true
# for pdf output:
# bibliography: tds.bib
---

# Statement {-}

```
TRAN5340M
Transport Data Science
Assignment Title:	Add your own
Student ID: Add your ID
Word Count:	xxxx	
Lecturer: Dr Robin Lovelace
Submission Date:		
Semester:				
Academic Year: 		202425
Generative AI Category: AMBER
```


Use of Generative Artificial Intelligence (Gen AI) in this assessment. Remove statement as appropriate:

I have made no use of Gen AI

I have used Gen AI only for the specific purposes outlined in my acknowledgements


By submitting the work to which this sheet is attached you confirm your compliance with the University’s definition of Academic Integrity as: “a commitment to good study practices and shared values which ensures that my work is a true expression of my own understanding and ideas, giving credit to others where their work contributes to mine”. Double-check that your referencing and use of quotations is consistent with this commitment. 

\newpage

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, eval = TRUE)
```

# Guidance {-}

This template contains information and suggested headings for the TDS module.
It is based on an RMarkdown file. See the source code here: https://github.com/ITSLeeds/TDS/blob/master/coursework-template.Rmd .
Do not submit coursework that contains this note or any text (other than the headings) from this template.
It is just designed to get you started. You will submit the .Rmd file, optionally included in a .zip file if you want to include data, and the resulting PDF document as your coursework submission.

As outlined in the module catalogue, the coursework should be a maximum of 10 pages long, excluding references and appendices (you should include your best work and think about maximising the use of space - the chunk option `out.width="50%"`, for example, can help with this as outlined [here](https://bookdown.org/yihui/bookdown/figures.html) )

An example of a good submission can be found here (requires a University of Leeds login): https://leeds365.sharepoint.com/sites/msteams_cbf52a/Shared%20Documents/General/stats19-example.zip

The information below provides guidance on the coursework. 

<!-- ## Two pager

- Deadline for non-assessed submission of a .zip file containing a 1 or 2 page pdf document with ideas before the final submission. The document will allow you to ask questions (e.g. "does this sound like a reasonable input dataset and topic?") and describe progress on reading in input datasets and the analysis plan. The document will contain:
  - A draft title of your topic
  - The main dataset that you will use and other datasets that you could use
  - Ideas on a research question
  - Questions you would like to ask about the topic, e.g. 'is this a suitable dataset?'
  - 2 or more references to the academic literature related to the topic
  - Any preliminary analysis you have done
  - The structure of the document could include
    - Topics considered
    - Input datasets
    - Analysis plan - I suggest creating a workflow diagram for this, e.g. as presented [here](https://user-images.githubusercontent.com/1825120/127524923-7d9f5511-84a6-430b-8de9-a603a5524f39.png)
    - Motivation for choosing this topic
    - Questions and options -->

<!-- ## Final submission -->
## Resit submission

- Deadline: Friday 17th May 2024, 14:00
- Format: a PDF file (max 10 pages, max word count of 3000 excluding tables, code, references, captions) and an .Rmd or .qmd file in a .zip file containing the Rmd file and minimal dataset needed to reproduce the results if possible (40 MB max size)
- Template: You can download a template .Rmd file as the basis of your submission: https://github.com/ITSLeeds/TDS/raw/master/coursework-template.Rmd

For the coursework you will submit a pdf a document with a maximum of 10 pages that contains code and results demonstrating your transport data science skills.

You should submit you work in  .zip file containing everything needed to reproduce the results: https://github.com/ITSLeeds/TDS/releases/download/0.20.1/coursework-template.zip

## Choosing a topic and writing your coursework

You will need to choose a topic, one or more datasets to analyse and research questions for the 10 page coursework report.
**Note: the topic must be entirely different from the one you chose for the first coursework.**

Designing and writing a good data science project involves many stages, not just writing and and knitting an Rmd document reporting data analysis methods and results.
The process involves:

- Brainstorming: what kind of topics and research questions are you interested in?
- Dataset identification: what datasets are available on the topic? If there are not good datasets related to the topic you may want to rethink your topic.
- Research questions: what questions do you want to know
- Data processing: what did you do to read-in the data?
- Exploratory data analysis: visualisation, describing the data
- Modelling/communication

As in real world data science work, there are many options to choose from.
You should decide on a topic based on your personal interest and the availability of a good dataset.
You can choose from and adapt one of the following options or choose a topic of your own.

### Topics

- Data collection and analysis
  - What is the relationship between travel behaviour (e.g. as manifested in origin-destination data represented as desire lines, routes and route networks) and road traffic casualties in a transport region (e.g. London, West Midlands and other regions in the `pct::pct_regions$region_name` data)
  - Analysis of a large transport dataset, e.g. https://www.nature.com/articles/sdata201889
- Infrastructure and travel behaviour
  - What are the relationships between specific types of infrastructure and travel, e.g. between fast roads and walking?
  - How do official sources of infrastructure data (e.g. the [CID](https://github.com/PublicHealthDataGeek/CycleInfraLnd/)) compare with crowd-sourced datasets such as OpenStreetMap (which can be accessed with the new [`osmextract` R package](https://github.com/ropensci/osmextract))
  - Using new data sources to support transport planning, e.g. using data from https://telraam.net/ or https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps 

- Changing transport systems
  - Modelling change in transport systems, e.g. by comparing before/after data for different countries/cities, which countries had the hardest lockdowns and where have changes been longer term? - see here for open data: https://github.com/ActiveConclusion/COVID19_mobility
  - How have movement patterns changed during the Coronavirus pandemic and what impact is that likely to have long term (see [here](https://saferactive.github.io/trafficalmr/articles/report3.html) for some graphics on this)
  
- Software development
  - Creating a package to make a particular data source more accessible, see https://github.com/ropensci/stats19 and https://github.com/elipousson/crashapi examples
  - Integration between R and A/B Street - see https://github.com/a-b-street/abstr
  
- Road safety - how can we makes roads and transport systems in general safer?
  - Influence of Road Infrastructure: 
    - 1. Assessing the role of well-designed pedestrian crossings, roundabouts, and traffic calming measures in preventing road accidents. 
    - 2. Investigating the correlation between road surface quality (e.g., potholes, uneven surfaces) and the frequency of accidents.
  - Influence of Traffic Management: 
    - 1. Assessing the role of traffic lights and speed cameras in preventing road accidents. 
    - 2. Investigating the correlation between the frequency of accidents and the presence of traffic calming measures (e.g., speed bumps, chicanes, road narrowing, etc.).
  - Legislation and Enforcement:
    - 1. Assessing the role of speed limits in preventing road accidents.

- Traffic congestion - how can we reduce congestion?
  - Data Collection and Analysis:
    - 1. Utilizing real-time traffic data from platforms like Waze and Google Maps to forecast congestion patterns.
    - 2. Analyzing historical traffic data to identify recurring congestion patterns and anticipate future traffic bottlenecks.
  - Machine Learning and Predictive Modeling:
    - 1. Designing machine learning models that use past and current traffic data to predict future congestion levels.
  
- Other
  - Other topics are welcome
  
### Datasets 

You should choose the main dataset that you will use for the coursework based on the topic and the availability of datasets. If you are interested in a particular dataset that could help you decide a topic. Good datasets include:

- STATS19 road crash data (other countries have other datasets), e.g. using data from the `stats19` package: https://docs.ropensci.org/stats19/
- 'PCT' data from UK travel behaviour - see https://itsleeds.github.io/pct/
- OpenStreetMap data (global, you will need to think of a subset by area/type), e.g. from the https://docs.ropensci.org/osmextract/index.html package
- Traffic count data, e.g. from the DfT, as described here: https://github.com/ITSLeeds/dftTrafficCounts
- Open data from a single city, e.g. Seattle: https://data-seattlecitygis.opendata.arcgis.com/
- See here: https://github.com/awesomedata/awesome-public-datasets#transportation
- And here: https://github.com/CUTR-at-USF/awesome-transit
- and [here](https://github.com/ITSLeeds/opentransportdata)

### Specific coursework options

If you are struggling for ideas and example code, these resources, in addition to the links provided in the lectures and practicals can help:

If this is a resit you must choose a different topic.

<!-- 1. Work through the stats19 training vignette to sharpen your R skills: https://docs.ropensci.org/stats19/articles/stats19-training.html -->

<!-- 2. Take a look at the Model Basics chapter (and the next if interested) of the book R for Data Science: https://r4ds.had.co.nz/model-basics.html -->

<!-- 3. Try to reproduce the results presented in the ML practical: https://github.com/ITSLeeds/TDS/blob/master/practicals/9-ml.md -->

<!-- When working on the homework you should be thinking about the datasets that you want to use for your coursework assignment. Ideas for datasets that you could use are: -->

You could pick one of these topics:

- What explanatory variables best predict the level of walking in Leeds (or a different English city), and how do walking levels relate to pedestrian safety? You could use the command `pct::get_desire_lines(region = "west-yorkshire")` to get data on walking desire lines and `get_stats19()` to get road crash data for Leeds. In terms of the topics covered in the lectures you could:
  - Show understanding of data science in context with a brief introduction that touches on the definition of transport data science
  - Demonstrate understanding of data structures by converting from a data frame of road crashes to an sf object
  - Show routes, e.g. by getting route data with `pct::get_pct_routes_fast()`
  - Show your data visualisation skills by visualising the datasets
  - Demonstrate understanding of modelling with a simple model to explain why the rate of walking varies, e.g. with the distance of the trip and variables that you will calculate (e.g. distance from the city centre)

- What factors are associated with low levels of walking and cycling and high road traffic casualty rates in the UK?

- How accessible are parks and other amenities for different areas of a particular city 
  - This could be done using data from OSM and perhaps official data from local government


## Report structure

The report should have a logical structure with key headings such as:

- Introduction
- Input data and data cleaning
- Exploratory data analysis
- Discussion (e.g. strengths and weaknesses)
- Conclusion (e.g. how the results could be used and next steps)
- References

<!-- TODO: split out this bit into marking-criteria.qmd -->

## Marks

Marks are awarded in 4 categories, accounting for the following criteria:

    **Data processing: 20%**

1. The selection and effective use of input datasets that are large (e.g. covering multiple years), complex (e.g. containing multiple variables) and/or diverse (e.g. input datasets from multiple sources are used and where appropriate combined in the analysis)
1. Describe how the data was collected and implications for data quality, and outline how the input datasets were downloaded (with a reproducible example if possible), with a description that will allow others to understand the structure of the inputs and how to import them
1. Evidence of data cleaning techniques (e.g. by re-categorising variables)
1. Adding value to datasets with joins (key-based or spatial), creation of new variables (also known as feature engineering) and reshaping data (e.g. from wide to long format)

**Distinction (70%+):** The report makes use of a complex (with many columns and rows) and/or multiple input datasets, efficiently importing them and adding value by creating new variables, recategorising, changing data formats/types, and/or reshaping the data. Selected datasets are very well suited to the research questions, clearly described, with links to the source and understanding of how the datasets were generated.

**Merit (60-69%):** The report makes some use of complex or multiple input datasets. The selection, description of, cleaning or value-added to the input datasets show skill and care applied to the data processing stage but with some weaknesses. Selected datasets are appropriate for the research questions, with some description or links to the data source.

**Pass (50-59%):** There is some evidence of care and attention put into the selection, description of or cleaning of the input datasets but little value has been added. The report makes little use of complex or multiple input datasets. The datasets are not appropriate for the research questions, the datasets are not clearly described, or there are no links to the source or understanding of how the datasets were generated, but the data processing aspect of the work acceptable.

**Fail (0-49%):** The report does not make use of appropriate input datasets and contains very little or now evidence of data cleaning, adding value to the datasets or reshaping the data. While there may be some evidence of data processing, it is of poor quality and/or not appropriate for the research questions.

    **Visualization and report: 20%**

1. Creation of figures that are readable and well-described (e.g. with captions and description)
1. High quality, attractive or advanced techniques (e.g. multi-layered maps or graphs, facets or other advanced techniques)
1. Using visualisation techniques appropriate to the topic and data and interpreting the results correctly (e.g. mentioning potential confounding factors that could account for observed patterns)
1. The report is well-formatted, accessible (e.g. with legible text size and does not contain excessive code in the submitted report) and clearly communicates the data and analysis visually, with appropriate figure captions, cross-references and a consistent style

**Distinction (70%+):** The report contains high quality, attractive, advanced and meaningful visualisations that are very well-described and interpreted, showing deep understanding of how visualisation can communicate meaning contained within datasets. The report is very well-formatted, accessible and clearly communicates the data and analysis visually.

**Merit (60-69%):** The report contains good visualisations that correctly present the data and highlight key patterns. The report is has appropriate formatting.

**Pass (50-59%):** The report contains basic visualisations or are not well-described or interpreted correctly or the report is poorly formatted, not accessible or does not clearly communicate the data and analysis visually.

**Fail (0-49%):** The report is of unacceptable quality (would likely be rejected in a professional setting) and/or has poor quality and/or few visualisations, or the visualisations are inappropriate given the data and research questions.

    **Code quality, efficiency and reproducibility: 20%**

1. Code quality in the submitted source code, including using consistent style, appropriate packages, and clear comments
1. Efficiency, including pre-processing to reduce input datasets (avoiding having to share large datasets in the submission for example) and computationally efficient implementations
1. The report is fully reproducible, including generation of figures. There are links to online resources for others wanting to reproduce the analysis for another area, and links to the input data

**Distinction (70%+):** The source code underlying the report contains high quality, efficient and reproducible code that is very well-written, using consistent syntax and good style, well-commented and uses appropriate packages. The report is fully reproducible, with links to online resources for others wanting to reproduce the analysis for another area, and links to the input data.

**Merit (60-69%):** The code is readable and describes the outputs in the report but lacks quality, either in terms of comments, efficiency or reproducibility. 

**Pass (50-59%):** The source code underlying the report describes the outputs in the report but is not well-commented, not efficient or has very limited levels of reproduicibility, with few links to online resources for others wanting to reproduce the analysis for another area, and few links to the input data.

**Fail (0-49%):** The report has little to no reproducible, readable or efficient code. A report that includes limited well-described code in the main text or in associated files would be considered at the borderline between a fail and a pass. A report that includes no code would be considered a low fail under this criterion.

    **Understanding the data science process, including choice of topic and impact: 40%**

1. Topic selection, including originality, availability of datasets related to the topic and relevance to solving transport planning problems
1. Clear research question
1. Appropriate reference to the academic, policy and/or technical literature and use of the literature to inform the research question and methods
1. Use of appropriate data science methods and techniques
1. Discussion of the strengths and weaknesses of the analysis and input datasets and/or how limitations could be addressed
1. Discuss further research and/or explain the potential impacts of the work
1. The conclusions are supported by the analysis and results
1. The contents of the report fit together logically and support the aims and/or research questions of the report

**Distinction (70%+):** The report contains a clear research question, appropriate reference to the academic, policy and/or technical literature, use of appropriate data science methods and techniques, discussion of the strengths and weaknesses of the analysis and input datasets and/or how limitations could be addressed. The report discusses further research and/or explores of the potential impacts of the work. Conclusions are supported by the analysis and results, and the contents of the report fit together logically as a cohehisive whole that has a clear direction set-out by the aims and/or research questions. To get a Distinction there should also be evidence of considering the generalisability of the methods and reflections on how it could be built on by others in other areas.

**Merit (60-69%):** There is a clear research question. There is some reference to the academic, policy and/or technical literature. The report has a good structure and the results are supported by the analysis. There is some discussion of the strengths and weaknesses of the analysis and input datasets and/or how limitations could be addressed. 

**Pass (50-59%):** The report contains a valid research question but only limited references to appropriate literature or justification. There is evidence of awareness of the limitations of the results and how they inform conclusions, but these are not fully supported by the analysis. The report has a reasonable structure but does not fit together well in a cohesive whole.

**Fail (0-49%):** The report does not contain a valid research question, has no references to appropriate literature or justification, does not discuss the limitations of the results or how they inform conclusions, or the report does not have a reasonable structure.


An example report structure is shown below.

## Information about RMarkdown

This is an R Markdown file.
You can set the output by changing `output: github_document` to something different, like `output: html_document`.
You will need to submit your work as a pdf document, which can be generated by converting html output to pdf (e.g. with the `pagedown` package) or (recommended) by setting the output to `pdf_document`.
The first lines of your RMarkdown document could look something like this to ensure that the output is a PDF document and that the R code does not run (set `eval = FALSE` to not run the R code):

```
---
title: "Coursework submission for Transport Data Science (TRAN5340M)"
subtitle: "Enter your own title here (e.g. Exploring open transport data: a study of the Isle of Wight)"
output:
  pdf_document:
    number_sections: true
author: "Student 12345"
bibliography: references.bib
---

```

```{r}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, eval = TRUE)
```


See here for more info: https://rmarkdown.rstudio.com/lesson-2.html


When you open this file in RStudio and click the **Knit** button all R code chunks are run and a markdown file (.md) suitable for publishing to GitHub is generated.

To ensure the document is reproducible, you should include a code chunk that shows which packages you used.
You may need to install new packages:

```{r, message=FALSE, eval=FALSE}
# install.packages("remotes")
install.packages("osmextract")
install.packages("pct")
install.packages("stats19")
```

We load the package as follows:

```{r, message=FALSE}
library(pct)
library(sf)
library(stplanr)
library(tidyverse)
library(tmap)
```

You can add references manually or with `[@citation-key]` references linking to a .bib file like this[@lovelace_stplanr_2017].
And this [@fox_data_2018].


## Including Code

You can include R code in the document as follows:

```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# Introduction

This example report explores road casualty data in the Isle of Wight.

# Datasets used

You can get zone, OD and even route data for any city in the UK with the following commands.
We got data for the Isle of Wight with the following commands:

```{r, echo=FALSE}
# old value
op = get("has_internet_via_proxy", environment(curl::has_internet))
# check for internet
np = !is.null(curl::nslookup("r-project.org", error = FALSE))
assign("has_internet_via_proxy", np, environment(curl::has_internet))
```


```{r, message=FALSE}
library(pct)
region_name = "isle-of-wight"
z = get_pct_zones(region = region_name, geography = "msoa")

od = get_od()
od_in_zones = od %>% 
  filter(geo_code1 %in% z$geo_code) %>% 
  filter(geo_code2 %in% z$geo_code) 
desire_lines = od2line(od_in_zones, z)
```


You could get data from OpenStreetMap with the `osmdata` package.

```{r, eval=FALSE}
library(osmdata)
osm_data = opq("isle of wight") %>% 
  add_osm_feature(key = "highway", value = "primary") %>% 
  osmdata_sf()
```

```{r, echo=FALSE}
# to use pre-saved data online you can do something like this:
# saveRDS(osm_data, "osm_data_highways_primary.Rds")
# piggyback::pb_upload("osm_data_highways_primary.Rds")
# piggyback::pb_download_url("osm_data_highways_primary.Rds")
u = "https://github.com/ITSLeeds/TDS/releases/download/0.20.1/osm_data_highways_primary.Rds"
osm_data = readRDS(url(u))
```

You can get large OSM datasets with `osmextract`:

```{r}
iow_highways = osmextract::oe_get("Isle of Wight", layer = "lines")
summary(as.factor(iow_highways$highway))
iow_highways2 = iow_highways %>% 
  filter(!is.na(highway)) %>% 
  filter(!str_detect(string = highway, pattern = "primary|track|resi|service|foot"))
summary(as.factor(iow_highways2$highway))
```


You could get road casualty data with the `stats19` package, as shown below.

```{r}
crashes = stats19::get_stats19(year = 2018, output_format = "sf") %>% 
  sf::st_transform(crs = sf::st_crs(z))

crashes_in_region = crashes[z, ]
tm_shape(z) +
  tm_fill("car_driver", palette = "Greys") +
  tm_shape(iow_highways2) +
  tm_lines(col = "green", lwd = 2) +
  tm_shape(osm_data$osm_lines) +
  tm_lines(col = "maxspeed", lwd = 5) +
  tm_shape(crashes_in_region) +
  tm_dots(size = 0.5, col = "red")
```

# Descriptive analysis

```{r}
plot(desire_lines)
```

# Route analysis

A next step could be route network analysis.

See https://luukvdmeer.github.io/sfnetworks/ for an approach we could use (this could be a coursework topic on its own).

```{r, echo=FALSE}
# devtools::install_github("luukvdmeer/sfnetworks")
```


```{r, echo=FALSE}
# sln = SpatialLinesNetwork(iow_highways2)
# sln_clean = sln_clean_graph(sln)
# plot(sln_clean@sl$`_ogr_geometry_`)
# centrality = igraph::edge_betweenness(sln_clean@g)
# centrality_normalised = centrality / mean(centrality)
```

```{r, echo=FALSE}
# mapview::mapview(z) +
#   mapview::mapview(sln_clean@sl, lwd = centrality_normalised * 3, zcol = "maxspeed")
```


# Additional datasets

# Policy analysis

Here you could explain how you explored answers to policy questions such as:

- how to make the roads safer?
- how to reduce congestion?
- where to build bike parking?

# Discussion

Include here limitations and ideas for further research.

# Conclusion

What are the main things we have learned from this project?

To make the code reproducible I saved the code and data as follows


```{r, eval=FALSE}
# Save outputs into zip file:
zip(zipfile = "coursework-template.zip", files = c(
  "coursework-template.Rmd",
  "tds.bib",
  "timetable.csv"
))
# piggyback::pb_upload("coursework-template.zip") # ignore this command
# piggyback::pb_upload("coursework-template.pdf") # ignore this command
```

# References

Add your references here.

Reference 1

Reference 2

...

Lovelace, R., 2021. Open source tools for geographic analysis in transport planning. J Geogr Syst. https://doi.org/10.1007/s10109-020-00342-2

Boeing, G., 2021. Spatial information and the legibility of urban form: Big data in urban morphology. International Journal of Information Management 56, 102013. https://doi.org/10.1016/j.ijinfomgt.2019.09.009