::p_load(sf,tidyverse,tmap,sfdep,plotly,stplanr,DT,sp,reshape2,ggpubr,units,knitr,performance) pacman
Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows
1. Overview
What motivates city residents to rise early and travel from their homes to work each morning? How does the cancellation of a bus service affect those living along its route? These are among the numerous challenges related to urban mobility that transport providers and city planners must tackle.
Traditionally, to answer such questions, commuter surveys have been the go-to method. However, this approach is expensive, time-intensive, and laborious. Not only that, the data collected often require extensive time to process and analyze, rendering them outdated by the time a report is ready.
With the digitization of city-wide infrastructures like buses, subways, utilities, and roads, the resulting data can serve as a basis for mapping out travel patterns over time and space. This has become increasingly feasible with the widespread adoption of pervasive computing technologies, such as GPS in vehicles and SMART cards by public transport users.
Unfortunately, this explosive growth of geospatially-referenced data has far outpaced the planner’s ability to utilize and transform the data into insightful information, thus creating an adverse impact on the return on the investment made to collect and manage these data.
2. Objective
This exercise is motivated by two main reasons.
Firstly,the recognition that despite the growing availability of open data for public use, there remains a notable gap in applied research demonstrating the integration, analysis, and modeling of these varied data sets to aid in policy making decisions.
Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.
Hence, the aim for this exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.
3. The Data
In this exercise, we will analyse the various data from various sources, as outlined in sections 3.1 and 3.2.
3.1 Aspatial Data
August, September and October 2023 Passenger Volume by Origin Destination Bus Stops data set were downloaded from the LTA DataMall. Please note that an API access application has to be submitted in order to download the dataset. For the purpose of this assignment, only August 2023 dataset will be used.
filtered_origin: an output saved out from Take-home Exercise 1. It provides the categorised peak hour commuting flows.
HDB: This data set is the geocoded version of September 2021 HDB Property Information data from data.gov.sg. This link provides a useful step-by-step guide.
School: Prepared based on School Directory and Information dataset from data.gov.sg and by geocoding the schools’ location using Singapore Land Authority (SLA) API. Please refer to In-class Exercise 4 for the details.
3.2 Geospatial Data
BusStop dataset was downloaded from the LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
Master Plan 2019 Subzone Boundary (No Sea) (i.e. `MPSZ-2019`) in ESRI shapefile format was downloaded from data.gov.sg. This data provides the sub-zone boundary of URA Master Plan 2019.
Business, entertn, F&B, FinServ, Leisure&Recreation and Retails are geospatial data sets of the locations of business establishments, entertainments, food and beverage outlets, financial centres, leisure and recreation centres, retail and services stores/outlets that can used for urban mobility study. (These datasets are provided by Prof Kam)
Train Station Exit Layer in ESRI shapfile format was downloaded from LTA DataMall. It provides the point geometry of the train station exits.
4. Getting the Data into R environment
4.1 Setting the R environment
In the following code chunk, p_load()
from pacman package is used to install and load the following R packages into the R environment:
sf
for importing, managing, and processing geospatial data,tidyverse
for performing data science tasks such as importing, wrangling and visualising data,tmap
for creating thematic maps,sfdep
for handling geospatial data, andplotly
for plotting interactive graphsstplanr
for transport planningDT
for displaying DataTablessp
for classes and methods for spatial datareshape2
for flexibly reshaping dataggpubr
for creating publication quality statistical graphicsunits
provides a class for maintaining unit metadataknitr
for creating html tableperformance
for computing model comparison matrices such as rmse
4.2 Importing the OD data
Firstly, we will import the August 2023 Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv()
of readr package.
<- read_csv("data/aspatial/origin_destination_bus_202308.csv") odbus
A quick check of odbus tibble data frame shows that it contains the following variables:
YEAR_MONTH: Year and Month of data collection in YYYY-MM format
DAY_TYPE: Type of the day (WEEKDAY or WEEKENDS/HOLIDAY)
TIME_PER_HOUR: Hour of the day of the passenger trip, in intervals from 0 to 23 hours
PT_TYPE: Type of public transport. As this is a bus data sets, only bus value should exist.
ORIGIN_PT_CODE: ID of Origin bus stop
DESTINATION_PT_CODE: ID of Destination bus stop
TOTAL_TRIPS: Total trips representing passenger volumes
Notice that that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.
glimpse(odbus)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
The code chunk below is used to change the ORIGIN_PT_CODE and DESTINATION_PT_CODE to factor data type because we want to use them for further processing such as georeference with Bus Stop Location data.
$ORIGIN_PT_CODE <- as.factor(odbus$ORIGIN_PT_CODE)
odbus$DESTINATION_PT_CODE <- as.factor(odbus$DESTINATION_PT_CODE) odbus
Notice that both of them are in factor type now.
glimpse(odbus)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
4.2.1 Extracting the OD study data
4.2.1.1 Choosing the peak hour time interval
In order to decide which month to do for our spatial interaction modelling, we will first check the distribution of the commuting flows for each peak hour time interval.
The code chunk below is used to load the odbus_grouped dataset that was saved out from Take-Home Exercise 1, which shows the commuting flow for each peak hour time interval.
#Load the odbus_grouped file
<- read_rds("data/rds/filtered_origin.rds") filtered_origin
Next, we group by origin-destination and peak hour period and summarise the total trips.
#Summarize the data to calculate the total trips per time interval
<- filtered_origin %>%
filtered_origin group_by(ORIGIN_PT_CODE, DESTINATION_PT_CODE, `Peak hour period`) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
Next, plot the log distribution of the trips over the different peak periods. Log transformation is needed due to the presence of extreme outliers when plotted (based on Take home Exercise 1).
Show the code
# Create a box plot
<- filtered_origin %>%
peak_boxplot_log ggplot(aes(x = `Peak hour period`, y = log(TRIPS), fill = `Peak hour period`)) + # fill color based on TIME_PER_HOUR
geom_boxplot() + # Adding the box plot
labs(
title = "Distribution of Log(Trips) across different peak periods",
x = "`Peak hour period`",
y = "Log(Total Trips)"
+
) theme_minimal() + # Using a minimal theme for a cleaner look
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")) # Bold title and axis labels
# Convert the ggplot object to an interactive Plotly object
<- ggplotly(peak_boxplot_log, tooltip = c("y"))
peak_logtrips
# Display the interactive plot
peak_logtrips
From the above log-transformed box plot, we can see that the number of trips during the weekday peaks are generally higher than the weekend/holiday peaks. Weekday peaks could thus be further analysed to determine the factors that could potentially cause this higher passenger volume flow.
4.2.1.2 Extract Weekday Afternoon Peak OD Data
For the purpose of this exercise, we shall focus on the Weekday afternoon peak (5pm to 8pm). The code chunk below is used to extract the commuting flows on weekday and between 5 and 8 o’clock.
<- odbus %>%
odbus5_8 filter(DAY_TYPE == "WEEKDAY") %>%
filter(TIME_PER_HOUR >= 17 &
<= 20) %>%
TIME_PER_HOUR group_by(ORIGIN_PT_CODE,
%>%
DESTINATION_PT_CODE) summarise(TRIPS = sum(TOTAL_TRIPS))
Table below shows the content of odbus5_8
datatable(odbus5_8)
We will save the output in rds format for future use.
write_rds(odbus5_8, "data/rds/odbus5_8.rds")
4.2.2 Visualising the Weekday peak hour commuting flows
Show the code
#Summarize the data to calculate the total trips per time interval
<- odbus %>%
odbus5_8_time filter(DAY_TYPE == "WEEKDAY") %>%
filter(TIME_PER_HOUR >= 17 &
<= 20) %>%
TIME_PER_HOUR group_by(TIME_PER_HOUR) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
# Create the ggplot with bold title and axis labels
<- ggplot(odbus5_8_time, aes(x = TIME_PER_HOUR, y = TRIPS, fill =factor(TIME_PER_HOUR)))+
p geom_bar(stat = "identity", position = position_dodge()) +
geom_text(aes(label = TRIPS)) + # Add labels
theme_minimal() +
labs(title = "Total Trips per Time Interval",
x = "TIME_OF_DAY",
y = "TOTAL_TRIPS",
fill = "TIME OF DAY") +
theme(axis.text.x = element_text(),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")) # Bold title and axis labels
# Convert the ggplot object to an interactive Plotly object
# Tooltip will display the data for each bar (total trips)
<- ggplotly(p, tooltip = c("y"))
dist_trips
# Display the interactive plot
dist_trips
From the plot above, we can see that 6pm has the highest bus commuter flow, which is not surprising, as this is the typical end-of-work time for most working population and it can also be attributed to students ending their extracurricular activities and/or extra lessons. We will further explore the commuting flows between the origin and destination busstop to determine the propulsiveness and attractiveness for their spatial interaction.
4.3 Importing Geospatial Data
The code chunk below uses st_read()
function of sf package to import BusStop and MPSZ-2019 shapefile into R as a simple feature data frame called bustop and mpsz respectively. As BusStop uses svy21 projected coordinate system, the crs
is set to 3414.
<- st_read(dsn = "data/geospatial",
busstop layer = "BusStop") %>%
st_transform(crs = 3414)
Reading layer `BusStop' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
<- st_read(dsn = "data/geospatial",
mpsz layer = "MPSZ-2019") %>%
st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS: WGS 84
Note that when the input geospatial data is in shapefile format, two arguments will be used, namely:
dsn
to define the data path andlayer
to provide the shapefile name. Also note that no extension such as .shp, .dbf, .prj and .shx are needed.st_read()
function of sf package is used to import the shapefile into R as sf data frame.st_transform()
function of sf package is used to transform the projection to crs 3414.
4.3.1 Geospatial data wrangling
4.3.1.1 Check number of unique busstops
# Count the number of unique BUS_STOP_N values
<- n_distinct(busstop$BUS_STOP_N)
num_unique_bus_stops
# Print the number of unique bus stops
print(num_unique_bus_stops)
[1] 5145
There are a total of 5145 unique bus stops.
4.3.1.2 Creating the hexagonal grids for busstops
st_make_grid()
function will be used to create the hexagonal grid. The purpose for changing the point geometry to hexagonal grids is to normalize geography for mapping and to mitigate the issues of using irregularly shaped polygons created arbitrarily (such as the boundaries) . For more information, please refer to the link here.
1. MPSZ
mpsz
hexagonal grids will be used as a base layer, in order to display the Singapore boundary.
Show the code
= st_make_grid(mpsz, c(750, 750), what = "polygons", square = FALSE)
area_honeycomb_grid
# To sf and add grid ID
= st_sf(area_honeycomb_grid) %>%
honeycomb_grid_sf # add grid ID
mutate(grid_id = 1:length(lengths(area_honeycomb_grid)))
# count number of points in each grid
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r
$n_colli = lengths(st_intersects(honeycomb_grid_sf, mpsz))
honeycomb_grid_sf
# remove grid without value of 0 (i.e. no points in side that grid)
= filter(honeycomb_grid_sf, n_colli > 0) # only display those that have busstops in the grid honeycomb_count
2. Busstops
= st_make_grid(busstop, c(750, 750), what = "polygons", square = FALSE)
area_bus_grid
# To sf and add grid ID
= st_sf(area_bus_grid) %>%
bus_grid_sf # add grid ID
mutate(grid_busstop_id = 1:length(lengths(area_bus_grid)))
# count number of points in each grid
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r
$n_colli_busstop = lengths(st_intersects(bus_grid_sf, busstop))
bus_grid_sf
# remove grid without value of 0 (i.e. no points in side that grid)
= filter(bus_grid_sf, n_colli_busstop > 0) # only display those that have busstops in the grid bus_count
cellsize is the hexagonal cells the distance between opposite edges. Given that the perpendicular distance between the center of the hexagon and its edges is 375m, to get the cellsize we multiply by 2, which equals to 750m. Studies have indicated that an individual is generally only willing to walk a maximum distance of 750m to the busstops.
grid_id is used to identify the busstop number in each hexagon grid
n_colli is calculated to find the number of busstops present in each hexagon grid.
Note that we have to exclude those hexagon grid that does not contain busstops (n_colli=0).
4.3.1.3 Distribution of busstops over Singapore
Understanding the spatial distribution of busstops in Singapore is essential for analyzing the accessibility of its public transport system. Thus, plot the distribution of busstops across Singapore into a thematic map with tmap
. This will provide a visual representation of the number of bus stops across different regions.
Show the code
tmap_mode("plot")
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops") +
tm_layout(main.title = "Distribution of busstops over Singapore",
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
frame = TRUE) +
tm_borders(alpha = 0.5) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
4.3.1.4 Append the busstops dataset information to bus_count
The code chunk below populates the BUS_STOP_N of busstop sf data into bus_count sf dataframe.
# Perform the spatial join
<- st_intersection(bus_count, busstop) %>%
bus_hex select(c(BUS_STOP_N, grid_busstop_id)) %>%
st_drop_geometry()
kable(head(bus_hex))
BUS_STOP_N | grid_busstop_id | |
---|---|---|
115 | 22069 | 573 |
112 | 32071 | 560 |
267 | 44331 | 977 |
780 | 96081 | 2129 |
347 | 11561 | 1159 |
558 | 66191 | 1543 |
st_intersection()
is used to perform point and polygon overly and the output will be in point sf object.select()
of dplyr package is then use to retain only BUS_STOP_N and grid_busstop_id in the bus_count sf data frame.
4.3.1.5 Join odbus_5_8 and bus_hex
Next, we have to associate each origin and destination bus stop to their corresponding hexagonal grid. To do so, we shall perform a left_join().
Origin
<- left_join(odbus5_8, bus_hex,
od_data by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
rename(ORIGIN_BUS_STOP_ID = grid_busstop_id)
Destination
<- left_join(od_data, bus_hex,
od_data by = c("DESTINATION_PT_CODE" = "BUS_STOP_N")) %>%
rename(DESTIN_BUS_STOP_ID = grid_busstop_id)
Before continuing, it is a good practice for us to check for duplicating records.
<- od_data %>%
duplicate group_by_all() %>%
filter(n()>1) %>%
ungroup()
There are a total of 2032 duplicate records. To remove the duplicates, the below code chunk will be used to only retain unique records.
<- unique(od_data) od_data
It will be a good practice to confirm if the duplicating records issue has been addressed fully.
Next, we will group by ORIGIN_BUS_STOP_ID and DESTIN_BUS_STOP_ID to aggregate the sum of trips at hexagonal level.
<- od_data %>%
od_data group_by(ORIGIN_BUS_STOP_ID, DESTIN_BUS_STOP_ID) %>%
summarise(EVENING_PEAK = sum(TRIPS)) %>%
drop_na()
kable(head(od_data))
ORIGIN_BUS_STOP_ID | DESTIN_BUS_STOP_ID | EVENING_PEAK |
---|---|---|
23 | 67 | 1 |
23 | 88 | 3 |
23 | 128 | 203 |
23 | 132 | 20 |
23 | 134 | 4 |
23 | 150 | 7 |
#Save the output into a rds file
write_rds(od_data, "data/rds/od_data.rds")
5. Visualising Spatial Interaction
In this section, we will prepare a desire line by using the stplanr
package.
5.1 Removing intra-zonal flows
We will not plot the intra-zonal flows. The code chunk below is used to remove intra-zonal flows.
<- od_data[od_data$ORIGIN_BUS_STOP_ID!=od_data$DESTIN_BUS_STOP_ID,]
od_data1
# Save the output into a rds file
write_rds(od_data1, "data/rds/od_data1.rds")
5.2 Creating desire lines
od2line()
of stplanr package will be used to create the desire lines.
<- od2line(flow = od_data1,
flowLine zones = bus_count,
zone_code = "grid_busstop_id")
5.3 Visualising the desire lines
To visualise the resulting desire lines, the code chunk below is used.
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) + #Base layer
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops") +
tm_shape(flowLine) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm)",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Be patient, the rendering process may take some time because of transparency argument (i.e alpha)
As can be seen from the above flow data, it is very messy and highly skewed, which hinders us from obtaining any useful insights. Thus, it will be wiser to focus on selected flows.
5.3.1 Determining the cut-off value for Evening peak trips
In deciding the appropriate cut-off evening peaks, we could look at the statistics of the trips, using summary()
and quantile()
.
summary(flowLine)
ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID EVENING_PEAK geometry
Min. : 23 Min. : 23 Min. : 1.0 LINESTRING :68246
1st Qu.:1116 1st Qu.:1107 1st Qu.: 8.0 epsg:3414 : 0
Median :1432 Median :1432 Median : 37.0 +proj=tmer...: 0
Mean :1415 Mean :1417 Mean : 354.2
3rd Qu.:1710 3rd Qu.:1727 3rd Qu.: 163.0
Max. :2505 Max. :2505 Max. :132650.0
quantile(flowLine$EVENING_PEAK)
0% 25% 50% 75% 100%
1 8 37 163 132650
Based on the summary, the median flow volume is 37 and the highest is 132,650, which is a drastic difference. The quantile results, also showed that there is huge difference in trips between the different quantiles.
To better visualise the commuting flow, we shall look only at the at the top 5% and 1% of the commuting flows.
$EVENING_PEAK %>% quantile(probs = c(0.95, 0.90)) flowLine
95% 90%
1372.75 619.00
Show the code
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops") +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 619)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm) \n Desire lines > 619",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Show the code
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops") +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 1372)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm) \n Desire lines > 1372",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
From the above plot, we can see the thick flow lines around the Woodlands area (i.e Woodland Checkpoint). This is probably because the bus stops may be near the bus and/or train interchanges. We could also see that the trips tend to be between North and East region, probably due to convenience of taking the bus instead of MRT.
Boon Lay area also seems to have high bus commute flow, which was rather surprising as it is not in the city area, this could also be due nearby bus or train station. There also seems to be a high traffic flow around the North-East region (i.e Ang Mo Kio), this could be residential area.
The flow lines seems to be sparse towards the periphery of the island and at the West region near Tuas, probably due to few bus stops available at that area.
We can also observe that there are several intersection at certain points in the map, this could be bus-interchanges, with multiple buses flowing in and out of the interchange.
6. Spatial Interaction Model
Spatial Interaction Models (SIMs) are mathematical models for estimating flows between spatial entities developed by Alan Wilson in the late 1960s and early 1970, with considerable uptake and refinement for transport modelling since then Boyce and Williams (2015).
There are four main types of traditional SIMs (Wilson 1971):
Unconstrained
Production-constrained
Attraction-constrained
Doubly-constrained
6.1 Computing Distance Matrix
In spatial interaction, a distance matrix is a table that shows the distance between pairs of location.
In this section, we will compute a distance matrix by using the Busstops at the analytical hexagon level.
6.1.1 Converting from sf data.table to Spatial Polygons Dataframe
There are several ways to compute the required distance matrix, one is based on sf and the other is based on sp. Past experience showed that computing distance matrix using sf function took relatively longer time than sp method, especially with large data set. Thus, we shall use sp method for the computation as shown in the code chunk below.
<- as(bus_count, "Spatial")
bus_sp bus_sp
class : SpatialPolygonsDataFrame
features : 834
extent : 3595.122, 48595.12, 26049.09, 53545.39 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 2
names : grid_busstop_id, n_colli_busstop
min values : 23, 1
max values : 2505, 19
as.Spatial()
is used to convert mpsz from sf tibble data frame to Spatial Polygons Dataframe of sp object.
6.1.2 Computing the distance matrix
Next, spDists()
of sp package will be used to compute the Euclidean distance between the centroids of the hexagonal grid.
<- spDists(bus_sp,
dist longlat = FALSE)
head(dist, n=c(10, 10))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.000 750.000 3269.174 1500.000 2704.163 3968.627 1299.038 2250.000
[2,] 750.000 0.000 2598.076 750.000 1984.313 3269.174 750.000 1500.000
[3,] 3269.174 2598.076 0.000 1984.313 750.000 750.000 2704.163 1500.000
[4,] 1500.000 750.000 1984.313 0.000 1299.038 2598.076 750.000 750.000
[5,] 2704.163 1984.313 750.000 1299.038 0.000 1299.038 1984.313 750.000
[6,] 3968.627 3269.174 750.000 2598.076 1299.038 0.000 3269.174 1984.313
[7,] 1299.038 750.000 2704.163 750.000 1984.313 3269.174 0.000 1299.038
[8,] 2250.000 1500.000 1500.000 750.000 750.000 1984.313 1299.038 0.000
[9,] 3436.932 2704.163 750.000 1984.313 750.000 750.000 2598.076 1299.038
[10,] 4683.748 3968.627 1500.000 3269.174 1984.313 750.000 3897.114 2598.076
[,9] [,10]
[1,] 3436.932 4683.748
[2,] 2704.163 3968.627
[3,] 750.000 1500.000
[4,] 1984.313 3269.174
[5,] 750.000 1984.313
[6,] 750.000 750.000
[7,] 2598.076 3897.114
[8,] 1299.038 2598.076
[9,] 0.000 1299.038
[10,] 1299.038 0.000
Notice that the output dist is a matrix object class of R. Also, notice that the column and row headers are not labelled with the grid_bustop_id.
6.1.3 Labelling column and row headers of the distance matrix
First, we will create a list sorted according to the distance matrix by grid_busstop_id.
<- bus_sp$grid_busstop_id grid_id_names
Next, we will attach grid_id_names to row and column headers for matching the distance matrix.
colnames(dist) <- paste0(grid_id_names)
rownames(dist) <- paste0(grid_id_names)
6.1.4 Pivoting distance_value by grid_busstop_id
Next, we will pivot the distance matrix into a long table by using the row and column grid_id_names as shown in the code chunk below.
<- melt(dist) %>%
distPair rename(dist = value)
head(distPair, 10)
Var1 Var2 dist
1 23 23 0.000
2 44 23 750.000
3 46 23 3269.174
4 66 23 1500.000
5 67 23 2704.163
6 68 23 3968.627
7 86 23 1299.038
8 87 23 2250.000
9 88 23 3436.932
10 89 23 4683.748
Notice that the within zone distance is 0.
6.1.5 Updating intra-zonal distances
In this section, we will append a constant value to replace intra-zonal distance of 0.
First, we will select and find out the minimum value of the distance by using summary()
.
%>%
distPair filter(dist > 0) %>%
summary()
Var1 Var2 dist
Min. : 23 Min. : 23 Min. : 750
1st Qu.: 871 1st Qu.: 871 1st Qu.: 8352
Median :1324 Median :1324 Median :13332
Mean :1269 Mean :1269 Mean :14162
3rd Qu.:1688 3rd Qu.:1688 3rd Qu.:18929
Max. :2505 Max. :2505 Max. :44680
Based on the summary, we can see that the minimum distance is 750m (distance between the centriod of a hexagon to the centroid of an adjacent hexagon), thus we shall set the intra-zonal distance to a constant distance value of 300m.
$dist <- ifelse(distPair$dist == 0,
distPair300, distPair$dist)
The code chunk below will be used to check the result data.frame.
%>%
distPair summary()
Var1 Var2 dist
Min. : 23 Min. : 23 Min. : 300
1st Qu.: 871 1st Qu.: 871 1st Qu.: 8250
Median :1324 Median :1324 Median :13332
Mean :1269 Mean :1269 Mean :14145
3rd Qu.:1688 3rd Qu.:1688 3rd Qu.:18929
Max. :2505 Max. :2505 Max. :44680
The code chunk below is used to rename the origin and destination fields.
<- distPair %>%
distPair rename(orig_grid_id = Var1,
destin_grid_id = Var2)
Lastly, the code chunk below is used to save the dataframe for future use.
write_rds(distPair, "data/rds/distPair.rds")
6.2 Preparing the inter zonal flow data
The code chunk below is used to run the code chunk that was saved earlier.
<- read_rds("data/rds/od_data1.rds") flow_data
Use the code chunk below to display the flow_data dataframe.
head(flow_data, 10)
# A tibble: 10 × 3
# Groups: ORIGIN_BUS_STOP_ID [2]
ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID EVENING_PEAK
<int> <int> <dbl>
1 23 67 1
2 23 88 3
3 23 128 203
4 23 132 20
5 23 134 4
6 23 150 7
7 23 154 1
8 23 175 178
9 44 128 2
10 44 134 20
Two new columns called FlowNoIntra and offset will be created by using the code chunk below.
$FlowNoIntra <- ifelse(
flow_data$ORIGIN_BUS_STOP_ID == flow_data$DESTIN_BUS_STOP_ID,
flow_data0, flow_data$EVENING_PEAK)
$offset <- ifelse(
flow_data$ORIGIN_BUS_STOP_ID == flow_data$DESTIN_BUS_STOP_ID,
flow_data0.000001, 1)
According to the syntax used to derive values in FlowNoIntra field, all intra-zonal flow will be given a value of 0 or else the original flow values will be inserted.
Next, inter-zonal flow will be selected from flow_data and save into a new output data.frame called inter_zonal_flow by using the code chunk below.
<- flow_data %>%
inter_zonal_flow filter(FlowNoIntra > 0)
6.2.1 Combining passenger volume data with distance value
Before we can join inter_zonal_flow and distPair, we need to convert data value type of ORIGIN_BUS_STOP_ID and DESTIN_BUS_STOP_ID fields of inter_zonal_flow dataframe into factor data type.
$ORIGIN_BUS_STOP_ID <- as.factor(inter_zonal_flow$ORIGIN_BUS_STOP_ID)
inter_zonal_flow$DESTIN_BUS_STOP_ID <- as.factor(inter_zonal_flow$DESTIN_BUS_STOP_ID) inter_zonal_flow
$orig_grid_id <- as.factor(distPair$orig_grid_id)
distPair$destin_grid_id <- as.factor(distPair$destin_grid_id) distPair
Next, left_join()
of dplyr will be used to join inter_zonal_flow dataframe and distPair dataframe. The output is called flow_data1.
<- inter_zonal_flow %>%
flow_data1 left_join (distPair,
by = c("ORIGIN_BUS_STOP_ID" = "orig_grid_id",
"DESTIN_BUS_STOP_ID" = "destin_grid_id")) %>%
rename(TRIPS = EVENING_PEAK,
DIST = dist)
6.3 Preparing Origin and Destination Attributes
To determine the factors affecting the urban commuting flows, we have to prepare the below propulsiveness and attractiveness variables required for calibrating the spatial interaction models.
Propulsive attribute: factors that influence the movement from one location to another at the origin.
Attractiveness attribute: factors that attract the passengers towards a specific destination.
6.3.1 Origin Attributes
1. Business
<- st_read(dsn = "data/geospatial",
business_sf layer = "Business") %>%
st_transform(crs = 3414)
Reading layer `Business' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 6550 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
Projected CRS: SVY21 / Singapore TM
Count the number of businesses inside each hexagonal grid
$`biz_count`<- lengths(
bus_countst_intersects(
bus_count, business_sf))
summary(bus_count$biz_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 7.273 7.000 97.000
The summary statistics above shows that there are hexagonal with no businesses, with most having only one business in it and the maximum is 97.
2. Financial Services
<- st_read(dsn = "data/geospatial",
finServ_sf layer = "FinServ") %>%
st_transform(crs = 3414)
Reading layer `FinServ' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3320 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
Projected CRS: SVY21 / Singapore TM
Count the number of financial services in each hexagonal grid
$`finserv_count`<- lengths(
bus_countst_intersects(
bus_count, finServ_sf))
summary(bus_count$finserv_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 3.922 4.000 139.000
The summary statistics above shows that there are hexagonal with no financial services, with most having only one financial services in it and the maximum is 139.
Next, plot the businesses and financial services on the hexagonal grid.
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops") +
tm_shape(business_sf) +
tm_dots(size=0.03, col ="brown") +
tm_shape(finServ_sf) +
tm_dots(size=0.03, col ="purple") +
tm_layout(main.title = "Business and Financial Services Locations",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count %>% filter(if_any(biz_count, ~. >= 1))) +
tm_fill("biz_count",
style = "quantile",
#palette = "YIOrBr",
title = "Number of Businsses") +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 1372)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "Bus commuting flow at Business locations",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count %>% filter(if_any(finserv_count, ~. >= 1))) +
tm_fill("finserv_count",
style = "quantile",
palette = "Purples",
title = "Number of Financial Services") +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 1372)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "Bus commuting flow at Financial Service locations",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Based on the plot above, we can see that there is a number of business that are located at the West and Central region. As for the financial service locations, it is mostly located at the central business district area and we can see there bus commute flow is heavy at area with higher number of financial services.
3. Schools
<- read_csv("data/aspatial/schools.csv") %>%
schools rename(latitude = results.LATITUDE,
longitude = results.LONGITUDE) %>%
select(postal_code, school_name, latitude, longitude)
<- st_as_sf(schools,
schools_sf coords = c("longitude", "latitude"),
crs=4326) %>%
st_transform(crs = 3414)
Count the number of schools available in each hexagonal grid
$`sch_count`<- lengths(
bus_countst_intersects(
bus_count, schools_sf))
summary(bus_count$sch_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.4125 1.0000 4.0000
The summary statistics above reveals that there are excessive 0 values in sch_count field. If log()
is going to be used to transform this field, additional step is required to ensure that all 0 will be replaced with a value between 0 and 1, but not including 0 and 1.
Next, plot the schools on the hexagonal grid.
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops")+
tm_shape(schools_sf) +
tm_dots(size=0.05, col ="yellow") +
tm_layout(main.title = "Schools Locations",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
6.3.2 Destination Attractiveness
1. MRT Locations
The train station exits could be a destination attractiveness as bus passengers could take the MRT as part of their journey, and or the MRT stations could be near bus-interchanges where passengers may take the feeder bus home.
<- st_read(dsn = "data/geospatial",
train_exit_sf layer = "Train_Station_Exit_Layer") %>%
st_transform(crs = 3414)
Reading layer `Train_Station_Exit_Layer' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
Check if the MRT Station Exit dataset contains duplicates.
<- train_exit_sf %>%
duplicate_mrt group_by(stn_name, exit_code) %>%
filter(n()>1) %>%
ungroup()
There are a total of 17 duplicated MRT Station Exit records. The code chunk below is used to retain only the unique records.
<- unique(train_exit_sf) train_exit_sf
Count the number of MRT Station Exits at each hexagonal grid
$`MRT_COUNT`<- lengths(
bus_countst_intersects(
bus_count, train_exit_sf))
It is always a good practice to examine the summary statistics of the derived variable.
summary(bus_count$MRT_COUNT)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.6715 0.0000 13.0000
The summary statistics above reveals that there are excessive 0 values in MRT_COUNT field. If log()
is going to be used to transform this field, additional step is required to ensure that all 0 will be replaced with a value between 0 and 1, but not including 0 and 1.
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of busstops")+
tm_shape(train_exit_sf) +
tm_dots(size=0.05, col ="red") +
tm_layout(main.title = "Train Staion Exit Locations",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
2. Leisure activities (i.e F&B, Entertainment, Leisure&Recreation, Retails)
(a) F&B
<- st_read(dsn = "data/geospatial",
fnb_sf layer = "F&B") %>%
st_transform(crs = 3414)
Reading layer `F&B' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 1919 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25343.27 xmax: 45462.43 ymax: 48796.21
Projected CRS: SVY21 / Singapore TM
Count the number of F&B in each hexagonal grid
$`fnb_count`<- lengths(
bus_countst_intersects(
bus_count, fnb_sf))
summary(bus_count$fnb_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 2.204 1.000 133.000
(b) Leisure & Recreation
<- st_read(dsn = "data/geospatial",
leisure_sf layer = "Liesure&Recreation") %>%
st_transform(crs = 3414)
Reading layer `Liesure&Recreation' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM
Count the number of Leisure & Recreation in each hexagonal grid
$`leisure_count`<- lengths(
bus_countst_intersects(
bus_count, leisure_sf))
summary(bus_count$leisure_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.307 1.000 41.000
(c) Entertainment
<- st_read(dsn = "data/geospatial",
entertn_sf layer = "entertn") %>%
st_transform(crs = 3414)
Reading layer `entertn' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 114 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 10809.34 ymin: 26528.63 xmax: 41600.62 ymax: 46375.77
Projected CRS: SVY21 / Singapore TM
Count number of Entertainment businesses in each hexagonal grid
$`entertn_count`<- lengths(
bus_countst_intersects(
bus_count, entertn_sf))
summary(bus_count$entertn_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.1331 0.0000 9.0000
(d) Retails
<- st_read(dsn = "data/geospatial",
retail_sf layer = "Retails") %>%
st_transform(crs = 3414)
Reading layer `Retails' from data source
`C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 37635 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 4737.982 ymin: 25171.88 xmax: 48265.04 ymax: 50135.28
Projected CRS: SVY21 / Singapore TM
Count the number of Retails in each hexagonal grid
$`retail_count`<- lengths(
bus_countst_intersects(
bus_count, retail_sf))
summary(bus_count$retail_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 8.00 44.04 36.75 1678.00
Next, plot all the leisure activities on the hexagonal grids.
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count) +
tm_fill("n_colli_busstop",
style = "quantile",
palette = "Blues",
title = "Number of Busstops") +
tm_shape(fnb_sf) +
tm_dots(size=0.06, col = "black") +
tm_shape(leisure_sf) +
tm_dots(size=0.06, col = "darkgreen") +
tm_shape(entertn_sf) +
tm_dots(size=0.06, col = "darkred") +
tm_shape(retail_sf) +
tm_dots(size=0.01, col = "orange") +
tm_layout(main.title = "Locations of Leisure activites",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Based on the map, we can see a large number of retails located at the central area (orange dots) and leisure (green dots) and F&B (black dots) locations seems to be scattered around different areas . Only a few entertainment locations can be spotted on the map (red dots).
3. HDB
We can count the number of HDB blocks in each hexagonal, as well as the total dwelling units. This provides us a population proxy in that each area.
<- read_csv("data/aspatial/hdb.csv") hdb_pop
<- hdb_pop %>%
hdb_filtered_sf filter(residential == "Y") %>%
select(blk_no, street, total_dwelling_units, lat, lng)
<- st_as_sf(hdb_filtered_sf,
hdb_sf coords = c('lng','lat'),
crs=4326) %>%
st_transform(crs=3414)
hdb_sf
Simple feature collection with 10181 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11519.15 ymin: 28097.64 xmax: 45224.46 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
# A tibble: 10,181 × 4
blk_no street total_dwelling_units geometry
* <chr> <chr> <dbl> <POINT [m]>
1 1 BEACH RD 142 (30309.27 30830.82)
2 1 BEDOK STH AVE 1 206 (39173.81 33678.86)
3 1 CHAI CHEE RD 102 (37949.03 34465.74)
4 1 CHANGI VILLAGE RD 55 (45224.46 41171.52)
5 1 DELTA AVE 96 (27473.09 30496.64)
6 1 DOVER RD 125 (22430.23 31652.72)
7 1 EUNOS CRES 247 (35695.14 33657.24)
8 1 EVERTON PK 95 (28899.23 28663.72)
9 1 GHIM MOH RD 220 (22829.05 32797.14)
10 1 HAIG RD 219 (35165.43 32621.5)
# ℹ 10,171 more rows
Count the number of HDBs in each hexagonal grid
$`hdb_count`<- lengths(
bus_countst_intersects(
bus_count, hdb_sf))
summary(bus_count$hdb_count)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 12.15 22.00 82.00
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(bus_count %>% filter(if_any(hdb_count, ~. > 0))) +
tm_fill('hdb_count',
alpha= 0.7) +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 1372)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "HDB Population density (by HDB Count)",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Total dwelling units in each hexagonal grid
Though hdb_count could be used as a proxy for population, we however need to note that different HDBs have different sizes, some may have more units, while other may have less units. Thus, we will also look at the total number of dwelling units at each hexagonal grid.
# Perform the spatial join
<- st_join(bus_count, hdb_sf, join = st_intersects) hdb_hex
<- hdb_hex %>%
hdb_hex group_by(grid_busstop_id) %>%
summarise(Total_dwelling_units = sum(total_dwelling_units)) %>%
drop_na()
<- hdb_hex %>%
hdb_hex1 st_drop_geometry()
<- bus_count %>%
bus_count left_join(hdb_hex1 %>%
select(grid_busstop_id, Total_dwelling_units),
by = "grid_busstop_id")
<- bus_count%>%
bus_count mutate(Total_dwelling_units = replace_na(Total_dwelling_units, 0))
Show the code
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
tm_polygons() +
tm_shape(hdb_hex %>% filter(if_any(Total_dwelling_units, ~. > 0))) +
tm_fill('Total_dwelling_units',
alpha= 0.7) +
tm_shape(flowLine %>%
filter(EVENING_PEAK > 1372)) +
tm_lines(lwd = "EVENING_PEAK",
style = "quantile",
col = "red",
scale = c(0.1, 1, 3, 5, 7, 10),
n = 6,
alpha = 0.3) +
tm_layout(main.title = "HDB Population density (By Total dwelling units)",
main.title.position = "center",
main.title.size = 1.0,
legend.height = 0.35,
legend.width = 0.35,
frame = TRUE) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: LTA DataMall",
position = c("left", "bottom"))
Looking at both HDB population density plots (HDB count and Total dwelling units) above, we can see that the East and North-East regions have a higher population density. Also, we can see that the bus commute flow seems to correspond with where the residential areas are at. This could be because bus stops are nearer to these residential areas as compared to MRT stations.
6.4 Append all attributes to flow_data
Before appending the attributes data to flow_data, change the grid_busstop_id to factor data type.
$grid_busstop_id <- as.factor(bus_count$grid_busstop_id) bus_count
Next, we are going to append all the Origin attributes from bus_count dataframe onto flow_data1 dataframe and rename the it as flow_data2.
<- flow_data1 %>%
flow_data2 left_join(bus_count,
by = c(ORIGIN_BUS_STOP_ID = 'grid_busstop_id')) %>%
select(1:6, 8:10) %>%
rename(ORIGIN_biz_count = biz_count,
ORIGIN_finserv_count = finserv_count,
ORIGIN_school_count = sch_count)
Below code chunk is used to append all the Destination attributes from bus_count dataframe onto flow_data2 dataframe.
<- flow_data2 %>%
flow_data2 left_join(bus_count,
by = c(DESTIN_BUS_STOP_ID = 'grid_busstop_id')) %>%
rename(DESTIN_MRT_COUNT = MRT_COUNT,
DESTIN_fnb_count = fnb_count,
DESTIN_leisure_count = leisure_count,
DESTIN_entertn_count = entertn_count,
DESTIN_retail_count = retail_count,
DESTIN_hdb_count = hdb_count,
DESTIN_total_dwelling_units = Total_dwelling_units) %>%
select(-c( n_colli_busstop,area_bus_grid,biz_count,finserv_count,sch_count))
Next, save the output data file flow_data2 in rds data file format.
write_rds(flow_data2 , "data/rds/flow_data2.rds")
6.5 Calibrating Spatial Interaction Models
In this section, we will calibrate the Spatial Interaction Models using Poisson Regression method.
6.5.1 Import the modelling data
Firstly, let’s import the modelling data by using the code chunk below.
<- read_rds("data/rds/flow_data2.rds") flow_data2
6.5.2 Visualising the dependent variable
Next, let’s plot the distribution of the dependent variable (i.e. TRIPS) using the code chunk below.
Show the code
ggplot(data = flow_data2,
aes(x = TRIPS)) +
geom_histogram()
Notice that the distribution is highly skewed and not resemble bell shape (also known as normal distribution).
Next, let’s visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.
Show the code
ggplot(data = flow_data2,
aes(x = DIST,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)
Notice that the relationship hardly resemble linear relationship.
On the other hand, if we plot the scatter plot by using the transformed version of both variables, we can see that their relationship resembles a more linear relationship.
Show the code
ggplot(data = flow_data2,
aes(x = log(DIST),
y = log(TRIPS))) +
geom_point() +
geom_smooth(method = lm)
6.5.3 Checking for variables with zero values
Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.
In the code chunk below, summary(
) of Base R is used to compute the summary statistics of all variables in flow_data2 data frame.
summary(flow_data2)
ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID TRIPS FlowNoIntra
1474 : 365 1474 : 327 Min. : 1.0 Min. : 1.0
1452 : 345 1516 : 308 1st Qu.: 8.0 1st Qu.: 8.0
1960 : 303 1960 : 303 Median : 37.0 Median : 37.0
1496 : 299 1477 : 296 Mean : 354.2 Mean : 354.2
1432 : 294 1496 : 294 3rd Qu.: 163.0 3rd Qu.: 163.0
1477 : 292 1749 : 281 Max. :132650.0 Max. :132650.0
(Other):66348 (Other):66437
offset DIST ORIGIN_biz_count ORIGIN_finserv_count
Min. :1 Min. : 750 Min. : 0.000 Min. : 0.000
1st Qu.:1 1st Qu.: 3269 1st Qu.: 0.000 1st Qu.: 1.000
Median :1 Median : 5662 Median : 1.000 Median : 3.000
Mean :1 Mean : 6501 Mean : 7.053 Mean : 8.469
3rd Qu.:1 3rd Qu.: 9124 3rd Qu.: 6.000 3rd Qu.: 9.000
Max. :1 Max. :24784 Max. :97.000 Max. :139.000
ORIGIN_school_count DESTIN_MRT_COUNT DESTIN_fnb_count DESTIN_leisure_count
Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median :0.0000 Median : 0.000 Median : 0.000 Median : 1.000
Mean :0.5327 Mean : 1.293 Mean : 5.234 Mean : 2.217
3rd Qu.:1.0000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 3.000
Max. :4.0000 Max. :13.000 Max. :133.000 Max. :41.000
DESTIN_entertn_count DESTIN_retail_count DESTIN_hdb_count
Min. :0.0000 Min. : 0.00 Min. : 0.00
1st Qu.:0.0000 1st Qu.: 7.00 1st Qu.: 0.00
Median :0.0000 Median : 28.00 Median : 8.00
Mean :0.3133 Mean : 91.09 Mean :16.62
3rd Qu.:0.0000 3rd Qu.: 98.00 3rd Qu.:29.00
Max. :9.0000 Max. :1678.00 Max. :82.00
DESTIN_total_dwelling_units
Min. : 0
1st Qu.: 0
Median : 912
Mean :1807
3rd Qu.:3250
Max. :7946
The print report above reveals that variables ORIGIN_biz_count, ORIGIN_finserv_count, ORIGIN_school_count, DESTIN_MRT_COUNT, DESTIN_fnb_count, DESTIN_leisure_count, DESTIN_entertn_count, DESTIN_retail_count, DESTIN_hdb_count and DESTIN_total_dwelling units.
In view of this, the code chunk below will be used to replace 0 values to 0.99.
Show the code
$DESTIN_MRT_COUNT <- ifelse(
flow_data2$DESTIN_MRT_COUNT == 0,
flow_data20.99, flow_data2$DESTIN_MRT_COUNT)
$DESTIN_fnb_count <- ifelse(
flow_data2$DESTIN_fnb_count == 0,
flow_data20.99, flow_data2$DESTIN_fnb_count)
$DESTIN_leisure_count <- ifelse(
flow_data2$DESTIN_leisure_count == 0,
flow_data20.99, flow_data2$DESTIN_leisure_count)
$DESTIN_entertn_count <- ifelse(
flow_data2$DESTIN_entertn_count == 0,
flow_data20.99, flow_data2$DESTIN_entertn_count)
$DESTIN_retail_count <- ifelse(
flow_data2$DESTIN_retail_count == 0,
flow_data20.99, flow_data2$DESTIN_retail_count)
$DESTIN_hdb_count <- ifelse(
flow_data2$DESTIN_hdb_count == 0,
flow_data20.99, flow_data2$DESTIN_hdb_count)
$DESTIN_total_dwelling_units <- ifelse(
flow_data2$DESTIN_total_dwelling_units == 0,
flow_data20.99, flow_data2$DESTIN_total_dwelling_units)
$ORIGIN_biz_count <- ifelse(
flow_data2$ORIGIN_biz_count == 0,
flow_data20.99, flow_data2$ORIGIN_biz_count)
$ORIGIN_finserv_count <- ifelse(
flow_data2$ORIGIN_finserv_count == 0,
flow_data20.99, flow_data2$ORIGIN_finserv_count)
$ORIGIN_school_count <- ifelse(
flow_data2$ORIGIN_school_count == 0,
flow_data20.99, flow_data2$ORIGIN_school_count)
Run the summary()
again to check the variables.
summary(flow_data2)
ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID TRIPS FlowNoIntra
1474 : 365 1474 : 327 Min. : 1.0 Min. : 1.0
1452 : 345 1516 : 308 1st Qu.: 8.0 1st Qu.: 8.0
1960 : 303 1960 : 303 Median : 37.0 Median : 37.0
1496 : 299 1477 : 296 Mean : 354.2 Mean : 354.2
1432 : 294 1496 : 294 3rd Qu.: 163.0 3rd Qu.: 163.0
1477 : 292 1749 : 281 Max. :132650.0 Max. :132650.0
(Other):66348 (Other):66437
offset DIST ORIGIN_biz_count ORIGIN_finserv_count
Min. :1 Min. : 750 Min. : 0.990 Min. : 0.990
1st Qu.:1 1st Qu.: 3269 1st Qu.: 0.990 1st Qu.: 1.000
Median :1 Median : 5662 Median : 1.000 Median : 3.000
Mean :1 Mean : 6501 Mean : 7.484 Mean : 8.707
3rd Qu.:1 3rd Qu.: 9124 3rd Qu.: 6.000 3rd Qu.: 9.000
Max. :1 Max. :24784 Max. :97.000 Max. :139.000
ORIGIN_school_count DESTIN_MRT_COUNT DESTIN_fnb_count DESTIN_leisure_count
Min. :0.990 Min. : 0.990 Min. : 0.990 Min. : 0.990
1st Qu.:0.990 1st Qu.: 0.990 1st Qu.: 0.990 1st Qu.: 0.990
Median :0.990 Median : 0.990 Median : 0.990 Median : 1.000
Mean :1.172 Mean : 1.919 Mean : 5.771 Mean : 2.619
3rd Qu.:1.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 3.000
Max. :4.000 Max. :13.000 Max. :133.000 Max. :41.000
DESTIN_entertn_count DESTIN_retail_count DESTIN_hdb_count
Min. :0.990 Min. : 0.99 Min. : 0.99
1st Qu.:0.990 1st Qu.: 7.00 1st Qu.: 0.99
Median :0.990 Median : 28.00 Median : 8.00
Mean :1.167 Mean : 91.16 Mean :16.98
3rd Qu.:0.990 3rd Qu.: 98.00 3rd Qu.:29.00
Max. :9.000 Max. :1678.00 Max. :82.00
DESTIN_total_dwelling_units
Min. : 0.99
1st Qu.: 0.99
Median : 912.00
Mean :1807.34
3rd Qu.:3250.00
Max. :7946.00
Notice that all the 0 values have been replaced by 0.99.
6.5.4 Unconstrained Model
In this section,we will calibrate an unconstrained spatial interaction model by using glm()
of Base Stats. The explanatory variables are origin population by different busineses, schools, financial services, destination population by MRT Station Exits, Leisure Activities and total dwelling units and distance between origin and destination in km (i.e. dist).
The general formula of Unconstrained Spatial Interaction Model
The code chunk used to calibrate to model is shown below:
<- glm(formula = TRIPS ~
uncflow log(ORIGIN_biz_count) +
log(ORIGIN_finserv_count) +
log(ORIGIN_school_count) +
log(DESTIN_MRT_COUNT) +
log(DESTIN_fnb_count) +
log(DESTIN_leisure_count) +
log(DESTIN_entertn_count) +
log(DESTIN_retail_count) +
log(DESTIN_hdb_count) +
log(DESTIN_total_dwelling_units) +
log(DIST),
family = poisson(link = "log"),
data = flow_data2,
na.action = na.exclude)
#save the output in rds file
write_rds(uncflow, "data/rds/uncflow.rds")
Call: glm(formula = TRIPS ~ log(ORIGIN_biz_count) + log(ORIGIN_finserv_count) +
log(ORIGIN_school_count) + log(DESTIN_MRT_COUNT) + log(DESTIN_fnb_count) +
log(DESTIN_leisure_count) + log(DESTIN_entertn_count) + log(DESTIN_retail_count) +
log(DESTIN_hdb_count) + log(DESTIN_total_dwelling_units) +
log(DIST), family = poisson(link = "log"), data = flow_data2,
na.action = na.exclude)
Coefficients:
(Intercept) log(ORIGIN_biz_count)
16.18643 -0.03511
log(ORIGIN_finserv_count) log(ORIGIN_school_count)
0.42101 0.13583
log(DESTIN_MRT_COUNT) log(DESTIN_fnb_count)
0.36511 -0.04208
log(DESTIN_leisure_count) log(DESTIN_entertn_count)
-0.08428 0.02249
log(DESTIN_retail_count) log(DESTIN_hdb_count)
0.03129 0.55035
log(DESTIN_total_dwelling_units) log(DIST)
-0.09394 -1.47462
Degrees of Freedom: 68245 Total (i.e. Null); 68234 Residual
Null Deviance: 96930000
Residual Deviance: 45890000 AIC: 46260000
6.5.5 Origin- (Production-) Constrained Model
Figure below shows the general formula of the origin-constrained model.
Code chunk below shows the calibration of the model by using glm()
of R and flow_data2.
<- glm(formula = TRIPS ~
orcflow +
ORIGIN_BUS_STOP_ID log(DESTIN_MRT_COUNT) +
log(DESTIN_fnb_count) +
log(DESTIN_leisure_count) +
log(DESTIN_entertn_count) +
log(DESTIN_hdb_count) +
log(DESTIN_total_dwelling_units) +
log(DESTIN_retail_count) +
log(DIST) - 1,
family = poisson(link = "log"),
data = flow_data2,
na.action = na.exclude)
#save the output in rds file
write_rds(orcflow, "data/rds/orcflow.rds")
For origin-constrained model, only explanatory variables representing the attractiveness at the destinations will be used.
All the explanatory variables including distance will be log transformed.
ORIGIN_BUS_STOP_ID is used to model 𝜇𝑖 . It must be in categorical data type.
It is important to note that -1 is added in the equation after the distance variable. The -1 serves the purpose of removing the intercept that by default, glm will insert into the model.
# Extract the summary of the model
<- summary(orcflow)
summary_orcflow
# Display the last 10 coefficient results
tail(summary_orcflow$coefficients, 10)
Estimate Std. Error z value
ORIGIN_BUS_STOP_ID2427 17.394118366 0.0085090459 2044.191390
ORIGIN_BUS_STOP_ID2505 17.428262800 0.0257404798 677.076066
log(DESTIN_MRT_COUNT) 0.461752870 0.0003555953 1298.534837
log(DESTIN_fnb_count) -0.002873364 0.0003183601 -9.025515
log(DESTIN_leisure_count) -0.033801045 0.0003724946 -90.742378
log(DESTIN_entertn_count) -0.038056644 0.0008567013 -44.422301
log(DESTIN_hdb_count) 0.432411867 0.0005023495 860.778873
log(DESTIN_total_dwelling_units) -0.039218349 0.0002257349 -173.736288
log(DESTIN_retail_count) 0.063396865 0.0001770021 358.170172
log(DIST) -1.501828723 0.0002604113 -5767.140144
Pr(>|z|)
ORIGIN_BUS_STOP_ID2427 0.000000e+00
ORIGIN_BUS_STOP_ID2505 0.000000e+00
log(DESTIN_MRT_COUNT) 0.000000e+00
log(DESTIN_fnb_count) 1.788521e-19
log(DESTIN_leisure_count) 0.000000e+00
log(DESTIN_entertn_count) 0.000000e+00
log(DESTIN_hdb_count) 0.000000e+00
log(DESTIN_total_dwelling_units) 0.000000e+00
log(DESTIN_retail_count) 0.000000e+00
log(DIST) 0.000000e+00
The model results shows that the number of MRTs, HDB blocks, Retails and F&Bs available have a statistically significant relationship with the number of weekday evening trips. In particular, the number of MRTs and HDB blocks have the highest positive association, at 0.462 and 0.432 respectively. This is unsurprising as the working population or students may have just end their day and they are either taking the bus home or transfer to nearest MRT station or taking a different bus. Conversely, the number of leisure and entertainment available has a negative association with the volume of trips, suggesting that destinations further away are less appealing for travel. This trend might be attributed to the dataset reflecting trips taken early or mid-week where individuals may be less inclined to travel long distances.
6.5.6 Destination Constrained Model
In this section, we will fit a destination constrained SIM by using the code chunk below.
The general formula of Destination Constrained Spatial Interaction Model,
<- glm(formula = TRIPS ~
decflow +
DESTIN_BUS_STOP_ID log(ORIGIN_biz_count) +
log(ORIGIN_finserv_count) +
log(ORIGIN_school_count) +
log(DIST)-1,
family = poisson(link = "log"),
data = flow_data2,
na.action = na.exclude)
#save the output in rds file
write_rds(decflow, "data/rds/decflow.rds")
For destination-constrained model, only explanatory variables representing the propulsiveness at the origins will be used.
All the explanatory variables including distance will be log transformed.
DESTIN_BUS_STOP_ID is used to model αi . It must be in categorical data type.
It is important to note that -1 is added in the equation after the distance variable. The -1 serves the purpose of removing the intercept that by default, glm will insert into the model.
# Extract the summary of the model
<- summary(decflow)
summary_decflow
# Display the last 10 coefficient results
tail(summary_decflow$coefficients, 10)
Estimate Std. Error z value Pr(>|z|)
DESTIN_BUS_STOP_ID2384 16.52166146 0.0162567710 1016.29416 0
DESTIN_BUS_STOP_ID2405 16.04568502 0.0173874126 922.83340 0
DESTIN_BUS_STOP_ID2406 14.29744785 0.0468686059 305.05383 0
DESTIN_BUS_STOP_ID2426 15.93981848 0.0319477321 498.93427 0
DESTIN_BUS_STOP_ID2427 16.94039034 0.0130352960 1299.57849 0
DESTIN_BUS_STOP_ID2505 15.33260264 0.0741558881 206.76177 0
log(ORIGIN_biz_count) 0.01733568 0.0001839409 94.24595 0
log(ORIGIN_finserv_count) 0.49079906 0.0001800543 2725.83865 0
log(ORIGIN_school_count) 0.15051552 0.0006865062 219.24858 0
log(DIST) -1.51858814 0.0002567294 -5915.13078 0
The model results shows that all the origin attributes has a statistically significant relationship with the number of weekday evening peak trips, with the number of financial services having the highest positive association at 0.491. This suggests that there is a stronger tendency for people to commence their journey from work or school locations, which makes sense, as it aligns with the typical end of day routines for most individuals.
6.5.7 Doubly Constrained Model
In this section, we will fit a doubly constrained SIM by using the code chunk below.
The general formula of Doubly Constrained Spatial Interaction Model,
<- glm(formula = TRIPS ~
dbcflow +
ORIGIN_BUS_STOP_ID +
DESTIN_BUS_STOP_ID log(DIST),
family = poisson(link = "log"),
data = flow_data2,
na.action = na.exclude)
#save the output in rds file
write_rds(dbcflow, "data/rds/dbcflow.rds")
It is important to note that there is a slight change of the code chunk. I have removed the -1 which means that an intercept will appear in the model again. This is not because I want an intercept as it makes the origin and destination coefficients harder to interpret, rather the -1 cheat for removing the intercept only works with one factor level but in double-constrained model we have two factor levels, namely: origins and destinations.
6.5.8 Model Comparison
6.5.8.1 Goodness of fit
R-squared function
In statistical modelling, the next question we would like to answer is how well the proportion of variance in the dependent variable (i.e. TRIPS) that can be explained by the explanatory variables.
In order to provide an answer to this question, R-squared statistics will be used. However, R-squared is not an output of glm()
. Hence, we will write a function called CalcRSquared by using the code chunk below.
<- function(observed,estimated){
CalcRSquared <- cor(observed,estimated)
r <- r^2
R2
R2 }
Next, we will compute the R-squared of the four SIMs by using the code chunk below.
CalcRSquared(uncflow$data$TRIPS, uncflow$fitted.values)
[1] 0.2347295
CalcRSquared(orcflow$data$TRIPS, orcflow$fitted.values)
[1] 0.4842719
CalcRSquared(decflow$data$TRIPS, decflow$fitted.values)
[1] 0.3463741
CalcRSquared(dbcflow$data$TRIPS, dbcflow$fitted.values)
[1] 0.6294211
The R-square results of the four SIM shows that doubly constrained spatial interaction model gives the best results (62.9%) as compared to other models. It suggests that 62.9% of the observed variation for the dependent variable (i.e Trips) can be explained by the the model’s inputs/variables.
Root Mean Squared Error (RSME)
Another useful model performance measure for continuous dependent variable is Root Mean Squared Error. In this sub-section, we will use compare_performance()
of performance package.
First of all, let us create a list called model_list by using the code chunk below.
<- list(unconstrained=uncflow,
model_list originConstrained=orcflow,
destinationConstrained=decflow,
doublyConstrained=dbcflow)
Next, we will compute the RMSE of all the models in model_list file by using the code chunk below.
compare_performance(model_list,
metrics = "RMSE")
# Comparison of Model Performance Indices
Name | Model | RMSE
-----------------------------------------
unconstrained | glm | 1524.153
originConstrained | glm | 1251.151
destinationConstrained | glm | 1408.526
doublyConstrained | glm | 1060.575
The print above reveals that doubly constrained SIM is the best model among all the four SIMs, as it has the smallest RMSE value of 1060.575.
6.5.8.2 Visualising fitted values
In this section, we will visualise the observed values and the fitted values.
First, we will extract the fitted values from Unconstrained Model by using the code chunk below.
<- as.data.frame(uncflow$fitted.values) %>%
df round(digits = 0)
Next we will append the fitted values into flow_data2 data frame by using the code chunk below.
<- flow_data2 %>%
flow_data2 cbind(df) %>%
rename(uncTRIPS = "uncflow$fitted.values")
Repeat the same step for Origin Constrained SIM (i.e. orcflow)
<- as.data.frame(orcflow$fitted.values) %>%
df round(digits = 0)
<- flow_data2 %>%
flow_data2 cbind(df) %>%
rename(orcTRIPS = "orcflow$fitted.values")
Repeat the same step for Destination Constrained SIM (i.e. decflow)
<- as.data.frame(decflow$fitted.values) %>%
df round(digits = 0)
<- flow_data2 %>%
flow_data2 cbind(df) %>%
rename(decTRIPS = "decflow$fitted.values")
Repeat the same step for Doubly Constrained SIM (i.e. dbcflow)
<- as.data.frame(dbcflow$fitted.values) %>%
df round(digits = 0)
<- flow_data2 %>%
flow_data2 cbind(df) %>%
rename(dbcTRIPS = "dbcflow$fitted.values")
Next, four scatterplots will be created using geom_point()
and other appropriate functions of ggplot2 package.
Show the code
<- ggplot(data = flow_data2,
unc_p aes(x = uncTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm) +
coord_cartesian(xlim=c(0,150000),
ylim=c(0,150000))
<- ggplot(data = flow_data2,
orc_p aes(x = orcTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)+
coord_cartesian(xlim=c(0,150000),
ylim=c(0,150000))
<- ggplot(data = flow_data2,
dec_p aes(x = decTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)+
coord_cartesian(xlim=c(0,150000),
ylim=c(0,150000))
<- ggplot(data = flow_data2,
dbc_p aes(x = dbcTRIPS,
y = TRIPS)) +
geom_point() +
geom_smooth(method = lm)+
coord_cartesian(xlim=c(0,150000),
ylim=c(0,150000))
#put all graphs into a single visual for better comparison
ggarrange(unc_p, orc_p, dec_p, dbc_p,
ncol = 2,
nrow = 2)
The above plot shows that the observed and fitted values in the doubly constrained model has a stronger linear trend as compared to other models.
7. Conclusion
Understanding the spatial patterns and determining the factors affecting public bus passenger flows during the different peak periods is essential for urban planning and transportation management, as they help identify areas that may require additional infrastructure, or policy interventions to balance the transit loads and improve overall accessibility.
For this assignment, Weekday evening peak (5pm - 8pm) spatial interaction flow was analysed. It was noted that Doubly constrained Model provided the best result. From the Origin constrained model, it showed that the number of MRTs, HDB blocks, Retails and F&B available are attractive attributes towards the destinations. From the Destination constrained model, it revealed that the number of businesses, financial services and schools are propulsive attributes at the origin.
To further enhance the quality of analysis, the following could be done,
- Passenger Volume could include the day of the week, as this could potentially influence travel patterns.
- Analysis could be based on a a larger dataset which includes a longer time range, other peak hours, etc.
- The Spatial Interaction Model assumes independence among observations, and this assumption seems heroic for many fundamentally spatial problems. It does not consider the propulsiveness and attractiveness of surrounding areas. Spatial Econometric Interaction Models could thus be performed to address this spatial independence problem.