::p_load(sf,tidyverse,tmap,sfdep,plotly,stplanr,DT,sp,reshape2,ggpubr,units,knitr,performance) pacman
Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows
1. Overview
What motivates city residents to rise early and travel from their homes to work each morning? How does the cancellation of a bus service affect those living along its route? These are among the numerous challenges related to urban mobility that transport providers and city planners must tackle.
Traditionally, to answer such questions, commuter surveys have been the go-to method. However, this approach is expensive, time-intensive, and laborious. Not only that, the data collected often require extensive time to process and analyze, rendering them outdated by the time a report is ready.
With the digitization of city-wide infrastructures like buses, subways, utilities, and roads, the resulting data can serve as a basis for mapping out travel patterns over time and space. This has become increasingly feasible with the widespread adoption of pervasive computing technologies, such as GPS in vehicles and SMART cards by public transport users.
Unfortunately, this explosive growth of geospatially-referenced data has far outpaced the planner’s ability to utilize and transform the data into insightful information, thus creating an adverse impact on the return on the investment made to collect and manage these data.
2. Objective
This exercise is motivated by two main reasons.
Firstly,the recognition that despite the growing availability of open data for public use, there remains a notable gap in applied research demonstrating the integration, analysis, and modeling of these varied data sets to aid in policy making decisions.
Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.
Hence, the aim for this exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.
3. The Data
In this exercise, we will analyse the various data from various sources, as outlined in sections 3.1 and 3.2.
3.1 Aspatial Data
August, September and October 2023 Passenger Volume by Origin Destination Bus Stops data set were downloaded from the LTA DataMall. Please note that an API access application has to be submitted in order to download the dataset. For the purpose of this assignment, only August 2023 dataset will be used.
filtered_origin: an output saved out from Take-home Exercise 1. It provides the categorised peak hour commuting flows.
HDB: This data set is the geocoded version of September 2021 HDB Property Information data from data.gov.sg. This link provides a useful step-by-step guide.
School: Prepared based on School Directory and Information dataset from data.gov.sg and by geocoding the schools’ location using Singapore Land Authority (SLA) API. Please refer to In-class Exercise 4 for the details.
3.2 Geospatial Data
BusStop dataset was downloaded from the LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
Master Plan 2019 Subzone Boundary (No Sea) (i.e. `MPSZ-2019`) in ESRI shapefile format was downloaded from data.gov.sg. This data provides the sub-zone boundary of URA Master Plan 2019.
Business, entertn, F&B, FinServ, Leisure&Recreation and Retails are geospatial data sets of the locations of business establishments, entertainments, food and beverage outlets, financial centres, leisure and recreation centres, retail and services stores/outlets that can used for urban mobility study. (These datasets are provided by Prof Kam)
Train Station Exit Layer in ESRI shapfile format was downloaded from LTA DataMall. It provides the point geometry of the train station exits.
4. Getting the Data into R environment
4.1 Setting the R environment
In the following code chunk, p_load()
from pacman package is used to install and load the following R packages into the R environment:
sf
for importing, managing, and processing geospatial data,tidyverse
for performing data science tasks such as importing, wrangling and visualising data,tmap
for creating thematic maps,sfdep
for handling geospatial data, andplotly
for plotting interactive graphsstplanr
for transport planningDT
for displaying DataTablessp
for classes and methods for spatial datareshape2
for flexibly reshaping dataggpubr
for creating publication quality statistical graphicsunits
provides a class for maintaining unit metadataknitr
for creating html tableperformance
for computing model comparison matrices such as rmse
4.2 Importing the OD data
Firstly, we will import the August 2023 Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv()
of readr package.
<- read_csv("data/aspatial/origin_destination_bus_202308.csv") odbus
A quick check of odbus tibble data frame shows that it contains the following variables:
YEAR_MONTH: Year and Month of data collection in YYYY-MM format
DAY_TYPE: Type of the day (WEEKDAY or WEEKENDS/HOLIDAY)
TIME_PER_HOUR: Hour of the day of the passenger trip, in intervals from 0 to 23 hours
PT_TYPE: Type of public transport. As this is a bus data sets, only bus value should exist.
ORIGIN_PT_CODE: ID of Origin bus stop
DESTINATION_PT_CODE: ID of Destination bus stop
TOTAL_TRIPS: Total trips representing passenger volumes
Notice that that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.
glimpse(odbus)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
The code chunk below is used to change the ORIGIN_PT_CODE and DESTINATION_PT_CODE to factor data type because we want to use them for further processing such as georeference with Bus Stop Location data.
$ORIGIN_PT_CODE <- as.factor(odbus$ORIGIN_PT_CODE)
odbus$DESTINATION_PT_CODE <- as.factor(odbus$DESTINATION_PT_CODE) odbus
Notice that both of them are in factor type now.
glimpse(odbus)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
4.2.1 Extracting the OD study data
4.2.1.1 Choosing the peak hour time interval
In order to decide which month to do for our spatial interaction modelling, we will first check the distribution of the commuting flows for each peak hour time interval.
The code chunk below is used to load the odbus_grouped dataset that was saved out from Take-Home Exercise 1, which shows the commuting flow for each peak hour time interval.
#Load the odbus_grouped file
<- read_rds("data/rds/filtered_origin.rds") filtered_origin
Next, we group by origin-destination and peak hour period and summarise the total trips.
#Summarize the data to calculate the total trips per time interval
<- filtered_origin %>%
filtered_origin group_by(ORIGIN_PT_CODE, DESTINATION_PT_CODE, `Peak hour period`) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
Next, plot the log distribution of the trips over the different peak periods. Log transformation is needed due to the presence of extreme outliers when plotted (based on Take home Exercise 1).
Show the code
# Create a box plot
<- filtered_origin %>%
peak_boxplot_log ggplot(aes(x = `Peak hour period`, y = log(TRIPS), fill = `Peak hour period`)) + # fill color based on TIME_PER_HOUR
geom_boxplot() + # Adding the box plot
labs(
title = "Distribution of Log(Trips) across different peak periods",
x = "`Peak hour period`",
y = "Log(Total Trips)"
+
) theme_minimal() + # Using a minimal theme for a cleaner look
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")) # Bold title and axis labels
# Convert the ggplot object to an interactive Plotly object
<- ggplotly(peak_boxplot_log, tooltip = c("y"))
peak_logtrips
# Display the interactive plot
peak_logtrips