Take-home Exercise 2: Applied Spatial Interaction Models: A case study of Singapore public bus commuter flows

Author

Dabbie Neo

Published

December 6, 2023

Modified

December 17, 2023

1. Overview

What motivates city residents to rise early and travel from their homes to work each morning? How does the cancellation of a bus service affect those living along its route? These are among the numerous challenges related to urban mobility that transport providers and city planners must tackle.

Traditionally, to answer such questions, commuter surveys have been the go-to method. However, this approach is expensive, time-intensive, and laborious. Not only that, the data collected often require extensive time to process and analyze, rendering them outdated by the time a report is ready.

With the digitization of city-wide infrastructures like buses, subways, utilities, and roads, the resulting data can serve as a basis for mapping out travel patterns over time and space. This has become increasingly feasible with the widespread adoption of pervasive computing technologies, such as GPS in vehicles and SMART cards by public transport users.

Unfortunately, this explosive growth of geospatially-referenced data has far outpaced the planner’s ability to utilize and transform the data into insightful information, thus creating an adverse impact on the return on the investment made to collect and manage these data.

2. Objective

This exercise is motivated by two main reasons.

Firstly,the recognition that despite the growing availability of open data for public use, there remains a notable gap in applied research demonstrating the integration, analysis, and modeling of these varied data sets to aid in policy making decisions.

Secondly, there is a general lack of practical research to show how geospatial data science and analysis (GDSA) can be used to support decision-making.

Hence, the aim for this exercise is to conduct a case study to demonstrate the potential value of GDSA to integrate publicly available data from multiple sources for building a spatial interaction models to determine factors affecting urban mobility patterns of public bus transit.

3. The Data

In this exercise, we will analyse the various data from various sources, as outlined in sections 3.1 and 3.2.

3.1 Aspatial Data

August, September and October 2023 Passenger Volume by Origin Destination Bus Stops data set were downloaded from the LTA DataMall. Please note that an API access application has to be submitted in order to download the dataset. For the purpose of this assignment, only August 2023 dataset will be used.
filtered_origin: an output saved out from Take-home Exercise 1. It provides the categorised peak hour commuting flows.
HDB: This data set is the geocoded version of September 2021 HDB Property Information data from data.gov.sg. This link provides a useful step-by-step guide.
School: Prepared based on School Directory and Information dataset from data.gov.sg and by geocoding the schools’ location using Singapore Land Authority (SLA) API. Please refer to In-class Exercise 4 for the details.

3.2 Geospatial Data

BusStop dataset was downloaded from the LTA DataMall. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
Master Plan 2019 Subzone Boundary (No Sea) (i.e. `MPSZ-2019`) in ESRI shapefile format was downloaded from data.gov.sg. This data provides the sub-zone boundary of URA Master Plan 2019.
Business, entertn, F&B, FinServ, Leisure&Recreation and Retails are geospatial data sets of the locations of business establishments, entertainments, food and beverage outlets, financial centres, leisure and recreation centres, retail and services stores/outlets that can used for urban mobility study. (These datasets are provided by Prof Kam)
Train Station Exit Layer in ESRI shapfile format was downloaded from LTA DataMall. It provides the point geometry of the train station exits.

4. Getting the Data into R environment

4.1 Setting the R environment

In the following code chunk, p_load() from pacman package is used to install and load the following R packages into the R environment:

sf for importing, managing, and processing geospatial data,
tidyverse for performing data science tasks such as importing, wrangling and visualising data,
tmap for creating thematic maps,
sfdep for handling geospatial data, and
plotly for plotting interactive graphs
stplanr for transport planning
DT for displaying DataTables
sp for classes and methods for spatial data
reshape2 for flexibly reshaping data
ggpubr for creating publication quality statistical graphics
units provides a class for maintaining unit metadata
knitr for creating html table
performance for computing model comparison matrices such as rmse

pacman::p_load(sf,tidyverse,tmap,sfdep,plotly,stplanr,DT,sp,reshape2,ggpubr,units,knitr,performance)

4.2 Importing the OD data

Firstly, we will import the August 2023 Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv() of readr package.

odbus <- read_csv("data/aspatial/origin_destination_bus_202308.csv")

A quick check of odbus tibble data frame shows that it contains the following variables:

YEAR_MONTH: Year and Month of data collection in YYYY-MM format
DAY_TYPE: Type of the day (WEEKDAY or WEEKENDS/HOLIDAY)
TIME_PER_HOUR: Hour of the day of the passenger trip, in intervals from 0 to 23 hours
PT_TYPE: Type of public transport. As this is a bus data sets, only bus value should exist.
ORIGIN_PT_CODE: ID of Origin bus stop
DESTINATION_PT_CODE: ID of Destination bus stop
TOTAL_TRIPS: Total trips representing passenger volumes

Notice that that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type.

glimpse(odbus)

Rows: 5,709,512
Columns: 7
$ YEAR_MONTH          <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE            <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS         <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …

The code chunk below is used to change the ORIGIN_PT_CODE and DESTINATION_PT_CODE to factor data type because we want to use them for further processing such as georeference with Bus Stop Location data.

odbus$ORIGIN_PT_CODE <- as.factor(odbus$ORIGIN_PT_CODE)
odbus$DESTINATION_PT_CODE <- as.factor(odbus$DESTINATION_PT_CODE)

Notice that both of them are in factor type now.

glimpse(odbus)

Rows: 5,709,512
Columns: 7
$ YEAR_MONTH          <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE            <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <fct> 04168, 04168, 80119, 80119, 44069, 44069, 20281, 2…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 17229, 20141, 2…
$ TOTAL_TRIPS         <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …

4.2.1 Extracting the OD study data

4.2.1.1 Choosing the peak hour time interval

In order to decide which month to do for our spatial interaction modelling, we will first check the distribution of the commuting flows for each peak hour time interval.

The code chunk below is used to load the odbus_grouped dataset that was saved out from Take-Home Exercise 1, which shows the commuting flow for each peak hour time interval.

#Load the odbus_grouped file 
filtered_origin <- read_rds("data/rds/filtered_origin.rds")

Next, we group by origin-destination and peak hour period and summarise the total trips.

#Summarize the data to calculate the total trips per time interval 
filtered_origin <- filtered_origin %>%
  group_by(ORIGIN_PT_CODE, DESTINATION_PT_CODE, `Peak hour period`) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS))

Next, plot the log distribution of the trips over the different peak periods. Log transformation is needed due to the presence of extreme outliers when plotted (based on Take home Exercise 1).

Show the code

# Create a box plot
peak_boxplot_log <- filtered_origin %>%
  ggplot(aes(x = `Peak hour period`, y = log(TRIPS), fill = `Peak hour period`)) + # fill color based on TIME_PER_HOUR
  geom_boxplot() + # Adding the box plot 
  labs(
    title = "Distribution of Log(Trips) across different peak periods",
    x = "`Peak hour period`",
    y = "Log(Total Trips)"
  ) +
  theme_minimal() + # Using a minimal theme for a cleaner look
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold")) # Bold title and axis labels

# Convert the ggplot object to an interactive Plotly object
peak_logtrips <- ggplotly(peak_boxplot_log, tooltip = c("y"))

# Display the interactive plot
peak_logtrips

From the above log-transformed box plot, we can see that the number of trips during the weekday peaks are generally higher than the weekend/holiday peaks. Weekday peaks could thus be further analysed to determine the factors that could potentially cause this higher passenger volume flow.

4.2.1.2 Extract Weekday Afternoon Peak OD Data

For the purpose of this exercise, we shall focus on the Weekday afternoon peak (5pm to 8pm). The code chunk below is used to extract the commuting flows on weekday and between 5 and 8 o’clock.

odbus5_8 <- odbus %>%
  filter(DAY_TYPE == "WEEKDAY") %>%
  filter(TIME_PER_HOUR >= 17 &
           TIME_PER_HOUR <= 20) %>%
  group_by(ORIGIN_PT_CODE,
           DESTINATION_PT_CODE) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS))

Table below shows the content of odbus5_8

datatable(odbus5_8)

Show entries

Search:

	ORIGIN_PT_CODE	DESTINATION_PT_CODE	TRIPS
1	01012	01112	561
2	01012	01113	507
3	01012	01121	317
4	01012	01211	387
5	01012	01311	672
6	01012	01549	6
7	01012	01559	12
8	01012	07371	39
9	01012	60011	48
10	01012	60021	19

Showing 1 to 10 of 265,450 entries

Previous1 2 3 4 5…26,545Next

We will save the output in rds format for future use.

write_rds(odbus5_8, "data/rds/odbus5_8.rds")

4.2.2 Visualising the Weekday peak hour commuting flows

Show the code

#Summarize the data to calculate the total trips per time interval 
odbus5_8_time <- odbus %>%
  filter(DAY_TYPE == "WEEKDAY") %>%
  filter(TIME_PER_HOUR >= 17 &
           TIME_PER_HOUR <= 20) %>%
  group_by(TIME_PER_HOUR) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS))

# Create the ggplot with bold title and axis labels
p <- ggplot(odbus5_8_time, aes(x = TIME_PER_HOUR, y = TRIPS, fill =factor(TIME_PER_HOUR)))+
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = TRIPS)) +  # Add labels
  theme_minimal() +
  labs(title = "Total Trips per Time Interval",
       x = "TIME_OF_DAY",
       y = "TOTAL_TRIPS",
       fill = "TIME OF DAY") +
  theme(axis.text.x = element_text(),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold"),
        axis.title.y = element_text(face = "bold")) # Bold title and axis labels

# Convert the ggplot object to an interactive Plotly object
# Tooltip will display the data for each bar (total trips)
dist_trips <- ggplotly(p, tooltip = c("y"))

# Display the interactive plot
dist_trips

From the plot above, we can see that 6pm has the highest bus commuter flow, which is not surprising, as this is the typical end-of-work time for most working population and it can also be attributed to students ending their extracurricular activities and/or extra lessons. We will further explore the commuting flows between the origin and destination busstop to determine the propulsiveness and attractiveness for their spatial interaction.

4.3 Importing Geospatial Data

The code chunk below uses st_read() function of sf package to import BusStop and MPSZ-2019 shapefile into R as a simple feature data frame called bustop and mpsz respectively. As BusStop uses svy21 projected coordinate system, the crs is set to 3414.

busstop <- st_read(dsn = "data/geospatial",
                   layer = "BusStop") %>%
  st_transform(crs = 3414)

Reading layer `BusStop' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

mpsz <- st_read(dsn = "data/geospatial",
                   layer = "MPSZ-2019") %>%
  st_transform(crs = 3414)

Reading layer `MPSZ-2019' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Note

Note that when the input geospatial data is in shapefile format, two arguments will be used, namely: dsn to define the data path and layer to provide the shapefile name. Also note that no extension such as .shp, .dbf, .prj and .shx are needed.
st_read() function of sf package is used to import the shapefile into R as sf data frame.
st_transform() function of sf package is used to transform the projection to crs 3414.

4.3.1 Geospatial data wrangling

4.3.1.1 Check number of unique busstops

# Count the number of unique BUS_STOP_N values
num_unique_bus_stops <- n_distinct(busstop$BUS_STOP_N)

# Print the number of unique bus stops
print(num_unique_bus_stops)

[1] 5145

There are a total of 5145 unique bus stops.

4.3.1.2 Creating the hexagonal grids for busstops

st_make_grid() function will be used to create the hexagonal grid. The purpose for changing the point geometry to hexagonal grids is to normalize geography for mapping and to mitigate the issues of using irregularly shaped polygons created arbitrarily (such as the boundaries) . For more information, please refer to the link here.

1. MPSZ

mpsz hexagonal grids will be used as a base layer, in order to display the Singapore boundary.

Show the code

area_honeycomb_grid = st_make_grid(mpsz, c(750, 750), what = "polygons", square = FALSE)

# To sf and add grid ID
honeycomb_grid_sf = st_sf(area_honeycomb_grid) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(area_honeycomb_grid)))

# count number of points in each grid
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r
honeycomb_grid_sf$n_colli = lengths(st_intersects(honeycomb_grid_sf, mpsz))

# remove grid without value of 0 (i.e. no points in side that grid)
honeycomb_count = filter(honeycomb_grid_sf, n_colli > 0) # only display those that have busstops in the grid

2. Busstops

area_bus_grid = st_make_grid(busstop, c(750, 750), what = "polygons", square = FALSE)

# To sf and add grid ID
bus_grid_sf = st_sf(area_bus_grid) %>%
  # add grid ID
  mutate(grid_busstop_id = 1:length(lengths(area_bus_grid)))

# count number of points in each grid
# https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r
bus_grid_sf$n_colli_busstop = lengths(st_intersects(bus_grid_sf, busstop))

# remove grid without value of 0 (i.e. no points in side that grid)
bus_count = filter(bus_grid_sf, n_colli_busstop > 0) # only display those that have busstops in the grid

Note

cellsize is the hexagonal cells the distance between opposite edges. Given that the perpendicular distance between the center of the hexagon and its edges is 375m, to get the cellsize we multiply by 2, which equals to 750m. Studies have indicated that an individual is generally only willing to walk a maximum distance of 750m to the busstops.
grid_id is used to identify the busstop number in each hexagon grid
n_colli is calculated to find the number of busstops present in each hexagon grid.
Note that we have to exclude those hexagon grid that does not contain busstops (n_colli=0).

4.3.1.3 Distribution of busstops over Singapore

Understanding the spatial distribution of busstops in Singapore is essential for analyzing the accessibility of its public transport system. Thus, plot the distribution of busstops across Singapore into a thematic map with tmap. This will provide a visual representation of the number of bus stops across different regions.

Show the code

tmap_mode("plot")
tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops") +
  tm_layout(main.title = "Distribution of busstops over Singapore",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE) +
  tm_borders(alpha = 0.5) +
  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

4.3.1.4 Append the busstops dataset information to bus_count

The code chunk below populates the BUS_STOP_N of busstop sf data into bus_count sf dataframe.

# Perform the spatial join
bus_hex <- st_intersection(bus_count, busstop) %>%
  select(c(BUS_STOP_N, grid_busstop_id)) %>%
  st_drop_geometry()

kable(head(bus_hex))

	BUS_STOP_N	grid_busstop_id
115	22069	573
112	32071	560
267	44331	977
780	96081	2129
347	11561	1159
558	66191	1543

Note

st_intersection() is used to perform point and polygon overly and the output will be in point sf object.
select() of dplyr package is then use to retain only BUS_STOP_N and grid_busstop_id in the bus_count sf data frame.

4.3.1.5 Join odbus_5_8 and bus_hex

Next, we have to associate each origin and destination bus stop to their corresponding hexagonal grid. To do so, we shall perform a left_join().

Origin

od_data <- left_join(odbus5_8, bus_hex,
                     by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  rename(ORIGIN_BUS_STOP_ID = grid_busstop_id)

Destination

od_data <- left_join(od_data, bus_hex,
                     by = c("DESTINATION_PT_CODE" = "BUS_STOP_N")) %>%
  rename(DESTIN_BUS_STOP_ID = grid_busstop_id)

Before continuing, it is a good practice for us to check for duplicating records.

duplicate <- od_data %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

There are a total of 2032 duplicate records. To remove the duplicates, the below code chunk will be used to only retain unique records.

od_data <- unique(od_data)

It will be a good practice to confirm if the duplicating records issue has been addressed fully.

Next, we will group by ORIGIN_BUS_STOP_ID and DESTIN_BUS_STOP_ID to aggregate the sum of trips at hexagonal level.

od_data <- od_data %>%
  group_by(ORIGIN_BUS_STOP_ID, DESTIN_BUS_STOP_ID) %>%
  summarise(EVENING_PEAK = sum(TRIPS)) %>%
  drop_na()
kable(head(od_data))

ORIGIN_BUS_STOP_ID	DESTIN_BUS_STOP_ID	EVENING_PEAK
23	67	1
23	88	3
23	128	203
23	132	20
23	134	4
23	150	7

#Save the output into a rds file
write_rds(od_data, "data/rds/od_data.rds")

5. Visualising Spatial Interaction

In this section, we will prepare a desire line by using the stplanr package.

5.1 Removing intra-zonal flows

We will not plot the intra-zonal flows. The code chunk below is used to remove intra-zonal flows.

od_data1 <- od_data[od_data$ORIGIN_BUS_STOP_ID!=od_data$DESTIN_BUS_STOP_ID,]

# Save the output into a rds file 
write_rds(od_data1, "data/rds/od_data1.rds")

5.2 Creating desire lines

od2line() of stplanr package will be used to create the desire lines.

flowLine <- od2line(flow = od_data1, 
                    zones = bus_count,
                    zone_code = "grid_busstop_id")

5.3 Visualising the desire lines

To visualise the resulting desire lines, the code chunk below is used.

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) + #Base layer
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops") +
  
  tm_shape(flowLine) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm)",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Warning

Be patient, the rendering process may take some time because of transparency argument (i.e alpha)

As can be seen from the above flow data, it is very messy and highly skewed, which hinders us from obtaining any useful insights. Thus, it will be wiser to focus on selected flows.

5.3.1 Determining the cut-off value for Evening peak trips

In deciding the appropriate cut-off evening peaks, we could look at the statistics of the trips, using summary() and quantile().

summary(flowLine)

 ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID  EVENING_PEAK               geometry    
 Min.   :  23       Min.   :  23       Min.   :     1.0   LINESTRING   :68246  
 1st Qu.:1116       1st Qu.:1107       1st Qu.:     8.0   epsg:3414    :    0  
 Median :1432       Median :1432       Median :    37.0   +proj=tmer...:    0  
 Mean   :1415       Mean   :1417       Mean   :   354.2                        
 3rd Qu.:1710       3rd Qu.:1727       3rd Qu.:   163.0                        
 Max.   :2505       Max.   :2505       Max.   :132650.0

quantile(flowLine$EVENING_PEAK)

    0%    25%    50%    75%   100% 
     1      8     37    163 132650

Based on the summary, the median flow volume is 37 and the highest is 132,650, which is a drastic difference. The quantile results, also showed that there is huge difference in trips between the different quantiles.

To better visualise the commuting flow, we shall look only at the at the top 5% and 1% of the commuting flows.

flowLine$EVENING_PEAK %>% quantile(probs = c(0.95, 0.90))

    95%     90% 
1372.75  619.00

Show the code

tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops") +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 619)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm) \n Desire lines > 619",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Show the code

tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops") +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 1372)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "Bus Passenger flow for Weekdays Evening Peak (5pm-8pm) \n Desire lines > 1372",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

From the above plot, we can see the thick flow lines around the Woodlands area (i.e Woodland Checkpoint). This is probably because the bus stops may be near the bus and/or train interchanges. We could also see that the trips tend to be between North and East region, probably due to convenience of taking the bus instead of MRT.

Boon Lay area also seems to have high bus commute flow, which was rather surprising as it is not in the city area, this could also be due nearby bus or train station. There also seems to be a high traffic flow around the North-East region (i.e Ang Mo Kio), this could be residential area.

The flow lines seems to be sparse towards the periphery of the island and at the West region near Tuas, probably due to few bus stops available at that area.

We can also observe that there are several intersection at certain points in the map, this could be bus-interchanges, with multiple buses flowing in and out of the interchange.

6. Spatial Interaction Model

Spatial Interaction Models (SIMs) are mathematical models for estimating flows between spatial entities developed by Alan Wilson in the late 1960s and early 1970, with considerable uptake and refinement for transport modelling since then Boyce and Williams (2015).

There are four main types of traditional SIMs (Wilson 1971):

Unconstrained
Production-constrained
Attraction-constrained
Doubly-constrained

6.1 Computing Distance Matrix

In spatial interaction, a distance matrix is a table that shows the distance between pairs of location.

In this section, we will compute a distance matrix by using the Busstops at the analytical hexagon level.

6.1.1 Converting from sf data.table to Spatial Polygons Dataframe

There are several ways to compute the required distance matrix, one is based on sf and the other is based on sp. Past experience showed that computing distance matrix using sf function took relatively longer time than sp method, especially with large data set. Thus, we shall use sp method for the computation as shown in the code chunk below.

bus_sp <- as(bus_count, "Spatial")
bus_sp

class       : SpatialPolygonsDataFrame 
features    : 834 
extent      : 3595.122, 48595.12, 26049.09, 53545.39  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 2
names       : grid_busstop_id, n_colli_busstop 
min values  :              23,               1 
max values  :            2505,              19

Note

as.Spatial() is used to convert mpsz from sf tibble data frame to Spatial Polygons Dataframe of sp object.

6.1.2 Computing the distance matrix

Next, spDists() of sp package will be used to compute the Euclidean distance between the centroids of the hexagonal grid.

dist <- spDists(bus_sp, 
                longlat = FALSE)
head(dist, n=c(10, 10))

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
 [1,]    0.000  750.000 3269.174 1500.000 2704.163 3968.627 1299.038 2250.000
 [2,]  750.000    0.000 2598.076  750.000 1984.313 3269.174  750.000 1500.000
 [3,] 3269.174 2598.076    0.000 1984.313  750.000  750.000 2704.163 1500.000
 [4,] 1500.000  750.000 1984.313    0.000 1299.038 2598.076  750.000  750.000
 [5,] 2704.163 1984.313  750.000 1299.038    0.000 1299.038 1984.313  750.000
 [6,] 3968.627 3269.174  750.000 2598.076 1299.038    0.000 3269.174 1984.313
 [7,] 1299.038  750.000 2704.163  750.000 1984.313 3269.174    0.000 1299.038
 [8,] 2250.000 1500.000 1500.000  750.000  750.000 1984.313 1299.038    0.000
 [9,] 3436.932 2704.163  750.000 1984.313  750.000  750.000 2598.076 1299.038
[10,] 4683.748 3968.627 1500.000 3269.174 1984.313  750.000 3897.114 2598.076
          [,9]    [,10]
 [1,] 3436.932 4683.748
 [2,] 2704.163 3968.627
 [3,]  750.000 1500.000
 [4,] 1984.313 3269.174
 [5,]  750.000 1984.313
 [6,]  750.000  750.000
 [7,] 2598.076 3897.114
 [8,] 1299.038 2598.076
 [9,]    0.000 1299.038
[10,] 1299.038    0.000

Notice that the output dist is a matrix object class of R. Also, notice that the column and row headers are not labelled with the grid_bustop_id.

6.1.3 Labelling column and row headers of the distance matrix

First, we will create a list sorted according to the distance matrix by grid_busstop_id.

grid_id_names <- bus_sp$grid_busstop_id

Next, we will attach grid_id_names to row and column headers for matching the distance matrix.

colnames(dist) <- paste0(grid_id_names)
rownames(dist) <- paste0(grid_id_names)

6.1.4 Pivoting distance_value by grid_busstop_id

Next, we will pivot the distance matrix into a long table by using the row and column grid_id_names as shown in the code chunk below.

distPair <- melt(dist) %>%
  rename(dist = value)
head(distPair, 10)

   Var1 Var2     dist
1    23   23    0.000
2    44   23  750.000
3    46   23 3269.174
4    66   23 1500.000
5    67   23 2704.163
6    68   23 3968.627
7    86   23 1299.038
8    87   23 2250.000
9    88   23 3436.932
10   89   23 4683.748

Notice that the within zone distance is 0.

6.1.5 Updating intra-zonal distances

In this section, we will append a constant value to replace intra-zonal distance of 0.

First, we will select and find out the minimum value of the distance by using summary().

distPair %>%
  filter(dist > 0) %>%
  summary()

      Var1           Var2           dist      
 Min.   :  23   Min.   :  23   Min.   :  750  
 1st Qu.: 871   1st Qu.: 871   1st Qu.: 8352  
 Median :1324   Median :1324   Median :13332  
 Mean   :1269   Mean   :1269   Mean   :14162  
 3rd Qu.:1688   3rd Qu.:1688   3rd Qu.:18929  
 Max.   :2505   Max.   :2505   Max.   :44680

Based on the summary, we can see that the minimum distance is 750m (distance between the centriod of a hexagon to the centroid of an adjacent hexagon), thus we shall set the intra-zonal distance to a constant distance value of 300m.

distPair$dist <- ifelse(distPair$dist == 0,
                        300, distPair$dist)

The code chunk below will be used to check the result data.frame.

distPair %>%
  summary()

      Var1           Var2           dist      
 Min.   :  23   Min.   :  23   Min.   :  300  
 1st Qu.: 871   1st Qu.: 871   1st Qu.: 8250  
 Median :1324   Median :1324   Median :13332  
 Mean   :1269   Mean   :1269   Mean   :14145  
 3rd Qu.:1688   3rd Qu.:1688   3rd Qu.:18929  
 Max.   :2505   Max.   :2505   Max.   :44680

The code chunk below is used to rename the origin and destination fields.

distPair <- distPair %>%
  rename(orig_grid_id = Var1,
         destin_grid_id = Var2)

Lastly, the code chunk below is used to save the dataframe for future use.

write_rds(distPair, "data/rds/distPair.rds")

6.2 Preparing the inter zonal flow data

The code chunk below is used to run the code chunk that was saved earlier.

flow_data <- read_rds("data/rds/od_data1.rds")

Use the code chunk below to display the flow_data dataframe.

head(flow_data, 10)

# A tibble: 10 × 3
# Groups:   ORIGIN_BUS_STOP_ID [2]
   ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID EVENING_PEAK
                <int>              <int>        <dbl>
 1                 23                 67            1
 2                 23                 88            3
 3                 23                128          203
 4                 23                132           20
 5                 23                134            4
 6                 23                150            7
 7                 23                154            1
 8                 23                175          178
 9                 44                128            2
10                 44                134           20

Two new columns called FlowNoIntra and offset will be created by using the code chunk below.

flow_data$FlowNoIntra <- ifelse(
  flow_data$ORIGIN_BUS_STOP_ID == flow_data$DESTIN_BUS_STOP_ID, 
  0, flow_data$EVENING_PEAK)
flow_data$offset <- ifelse(
  flow_data$ORIGIN_BUS_STOP_ID == flow_data$DESTIN_BUS_STOP_ID, 
  0.000001, 1)

According to the syntax used to derive values in FlowNoIntra field, all intra-zonal flow will be given a value of 0 or else the original flow values will be inserted.

Next, inter-zonal flow will be selected from flow_data and save into a new output data.frame called inter_zonal_flow by using the code chunk below.

inter_zonal_flow <- flow_data %>%
  filter(FlowNoIntra > 0)

6.2.1 Combining passenger volume data with distance value

Before we can join inter_zonal_flow and distPair, we need to convert data value type of ORIGIN_BUS_STOP_ID and DESTIN_BUS_STOP_ID fields of inter_zonal_flow dataframe into factor data type.

inter_zonal_flow$ORIGIN_BUS_STOP_ID <- as.factor(inter_zonal_flow$ORIGIN_BUS_STOP_ID)
inter_zonal_flow$DESTIN_BUS_STOP_ID <- as.factor(inter_zonal_flow$DESTIN_BUS_STOP_ID)

distPair$orig_grid_id <- as.factor(distPair$orig_grid_id)
distPair$destin_grid_id <- as.factor(distPair$destin_grid_id)

Next, left_join() of dplyr will be used to join inter_zonal_flow dataframe and distPair dataframe. The output is called flow_data1.

flow_data1 <- inter_zonal_flow %>%
  left_join (distPair,
             by = c("ORIGIN_BUS_STOP_ID" = "orig_grid_id",
                    "DESTIN_BUS_STOP_ID" = "destin_grid_id")) %>%
  rename(TRIPS = EVENING_PEAK,
         DIST = dist)

6.3 Preparing Origin and Destination Attributes

To determine the factors affecting the urban commuting flows, we have to prepare the below propulsiveness and attractiveness variables required for calibrating the spatial interaction models.

Propulsive attribute: factors that influence the movement from one location to another at the origin.
Attractiveness attribute: factors that attract the passengers towards a specific destination.

6.3.1 Origin Attributes

1. Business

business_sf <- st_read(dsn = "data/geospatial",
                      layer = "Business") %>%
  st_transform(crs = 3414)

Reading layer `Business' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 6550 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
Projected CRS: SVY21 / Singapore TM

Count the number of businesses inside each hexagonal grid

bus_count$`biz_count`<- lengths(
  st_intersects(
    bus_count, business_sf))

summary(bus_count$biz_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   7.273   7.000  97.000

The summary statistics above shows that there are hexagonal with no businesses, with most having only one business in it and the maximum is 97.

2. Financial Services

finServ_sf <- st_read(dsn = "data/geospatial",
                      layer = "FinServ") %>%
  st_transform(crs = 3414)

Reading layer `FinServ' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3320 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
Projected CRS: SVY21 / Singapore TM

Count the number of financial services in each hexagonal grid

bus_count$`finserv_count`<- lengths(
  st_intersects(
    bus_count, finServ_sf))

summary(bus_count$finserv_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   3.922   4.000 139.000

The summary statistics above shows that there are hexagonal with no financial services, with most having only one financial services in it and the maximum is 139.

Next, plot the businesses and financial services on the hexagonal grid.

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops") +
  
  tm_shape(business_sf) +
  tm_dots(size=0.03, col ="brown") +
  
  tm_shape(finServ_sf) +
  tm_dots(size=0.03, col ="purple") +
  
  tm_layout(main.title = "Business and Financial Services Locations",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count %>% filter(if_any(biz_count, ~. >= 1))) +
  tm_fill("biz_count", 
          style = "quantile", 
          #palette = "YIOrBr",
          title = "Number of Businsses") +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 1372)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "Bus commuting flow at Business locations",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count %>% filter(if_any(finserv_count, ~. >= 1))) +
  tm_fill("finserv_count", 
          style = "quantile", 
          palette = "Purples",
          title = "Number of Financial Services") +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 1372)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "Bus commuting flow at Financial Service locations",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Based on the plot above, we can see that there is a number of business that are located at the West and Central region. As for the financial service locations, it is mostly located at the central business district area and we can see there bus commute flow is heavy at area with higher number of financial services.

3. Schools

schools <- read_csv("data/aspatial/schools.csv") %>%
  rename(latitude = results.LATITUDE,
         longitude = results.LONGITUDE) %>%
  select(postal_code, school_name, latitude, longitude)

schools_sf <- st_as_sf(schools, 
                       coords = c("longitude", "latitude"),
                       crs=4326) %>%
  st_transform(crs = 3414)

Count the number of schools available in each hexagonal grid

bus_count$`sch_count`<- lengths(
  st_intersects(
    bus_count, schools_sf))

summary(bus_count$sch_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.4125  1.0000  4.0000

The summary statistics above reveals that there are excessive 0 values in sch_count field. If log() is going to be used to transform this field, additional step is required to ensure that all 0 will be replaced with a value between 0 and 1, but not including 0 and 1.

Next, plot the schools on the hexagonal grid.

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops")+
  
  tm_shape(schools_sf) +
  tm_dots(size=0.05, col ="yellow") +
  
  tm_layout(main.title = "Schools Locations",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

6.3.2 Destination Attractiveness

1. MRT Locations

The train station exits could be a destination attractiveness as bus passengers could take the MRT as part of their journey, and or the MRT stations could be near bus-interchanges where passengers may take the feeder bus home.

train_exit_sf <- st_read(dsn = "data/geospatial",
                         layer = "Train_Station_Exit_Layer") %>%   
  st_transform(crs = 3414)

Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 565 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21

Check if the MRT Station Exit dataset contains duplicates.

duplicate_mrt <- train_exit_sf %>%
  group_by(stn_name, exit_code) %>%
  filter(n()>1) %>%
  ungroup()

There are a total of 17 duplicated MRT Station Exit records. The code chunk below is used to retain only the unique records.

train_exit_sf <- unique(train_exit_sf)

Count the number of MRT Station Exits at each hexagonal grid

bus_count$`MRT_COUNT`<- lengths(
  st_intersects(
    bus_count, train_exit_sf))

It is always a good practice to examine the summary statistics of the derived variable.

summary(bus_count$MRT_COUNT)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.6715  0.0000 13.0000

Danger

The summary statistics above reveals that there are excessive 0 values in MRT_COUNT field. If log() is going to be used to transform this field, additional step is required to ensure that all 0 will be replaced with a value between 0 and 1, but not including 0 and 1.

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile", 
          palette = "Blues",
          title = "Number of busstops")+
  
  tm_shape(train_exit_sf) +
  tm_dots(size=0.05, col ="red") +
  
  tm_layout(main.title = "Train Staion Exit Locations",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

2. Leisure activities (i.e F&B, Entertainment, Leisure&Recreation, Retails)

(a) F&B

fnb_sf <- st_read(dsn = "data/geospatial",
                      layer = "F&B") %>%
  st_transform(crs = 3414)

Reading layer `F&B' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1919 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6010.495 ymin: 25343.27 xmax: 45462.43 ymax: 48796.21
Projected CRS: SVY21 / Singapore TM

Count the number of F&B in each hexagonal grid

bus_count$`fnb_count`<- lengths(
  st_intersects(
    bus_count, fnb_sf))

summary(bus_count$fnb_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   2.204   1.000 133.000

(b) Leisure & Recreation

leisure_sf <- st_read(dsn = "data/geospatial",
                      layer = "Liesure&Recreation") %>%   
  st_transform(crs = 3414)

Reading layer `Liesure&Recreation' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1217 features and 30 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6010.495 ymin: 25134.28 xmax: 48439.77 ymax: 50078.88
Projected CRS: SVY21 / Singapore TM

Count the number of Leisure & Recreation in each hexagonal grid

bus_count$`leisure_count`<- lengths(
  st_intersects(
    bus_count, leisure_sf))

summary(bus_count$leisure_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   1.307   1.000  41.000

(c) Entertainment

entertn_sf <- st_read(dsn = "data/geospatial",
                      layer = "entertn") %>%
  st_transform(crs = 3414)

Reading layer `entertn' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 114 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 10809.34 ymin: 26528.63 xmax: 41600.62 ymax: 46375.77
Projected CRS: SVY21 / Singapore TM

Count number of Entertainment businesses in each hexagonal grid

bus_count$`entertn_count`<- lengths(
  st_intersects(
    bus_count, entertn_sf))

summary(bus_count$entertn_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1331  0.0000  9.0000

(d) Retails

retail_sf <- st_read(dsn = "data/geospatial",
                     layer = "Retails") %>%   
  st_transform(crs = 3414)

Reading layer `Retails' from data source 
  `C:\Users\Dabbie Neo\Desktop\Geospatial_Analytics\dneoo\ISSS624\Take-Home_Ex\Take-Home_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 37635 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 4737.982 ymin: 25171.88 xmax: 48265.04 ymax: 50135.28
Projected CRS: SVY21 / Singapore TM

Count the number of Retails in each hexagonal grid

bus_count$`retail_count`<- lengths(
  st_intersects(
    bus_count, retail_sf))

summary(bus_count$retail_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    8.00   44.04   36.75 1678.00

Next, plot all the leisure activities on the hexagonal grids.

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count) +
  tm_fill("n_colli_busstop", 
          style = "quantile",
          palette = "Blues",
          title = "Number of Busstops") +
  
  tm_shape(fnb_sf) +
  tm_dots(size=0.06, col = "black") +
  
  tm_shape(leisure_sf) +
  tm_dots(size=0.06, col = "darkgreen") +
  
  tm_shape(entertn_sf) +
  tm_dots(size=0.06, col = "darkred") +
  
  tm_shape(retail_sf) +
  tm_dots(size=0.01, col = "orange") +
  
  tm_layout(main.title = "Locations of Leisure activites",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Based on the map, we can see a large number of retails located at the central area (orange dots) and leisure (green dots) and F&B (black dots) locations seems to be scattered around different areas . Only a few entertainment locations can be spotted on the map (red dots).

3. HDB

We can count the number of HDB blocks in each hexagonal, as well as the total dwelling units. This provides us a population proxy in that each area.

hdb_pop <- read_csv("data/aspatial/hdb.csv")

hdb_filtered_sf <- hdb_pop %>%
  filter(residential == "Y") %>%
  select(blk_no, street, total_dwelling_units, lat, lng)

hdb_sf <- st_as_sf(hdb_filtered_sf,
                   coords = c('lng','lat'),
                        crs=4326) %>% 
  st_transform(crs=3414)

hdb_sf

Simple feature collection with 10181 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.15 ymin: 28097.64 xmax: 45224.46 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
# A tibble: 10,181 × 4
   blk_no street            total_dwelling_units            geometry
 * <chr>  <chr>                            <dbl>         <POINT [m]>
 1 1      BEACH RD                           142 (30309.27 30830.82)
 2 1      BEDOK STH AVE 1                    206 (39173.81 33678.86)
 3 1      CHAI CHEE RD                       102 (37949.03 34465.74)
 4 1      CHANGI VILLAGE RD                   55 (45224.46 41171.52)
 5 1      DELTA AVE                           96 (27473.09 30496.64)
 6 1      DOVER RD                           125 (22430.23 31652.72)
 7 1      EUNOS CRES                         247 (35695.14 33657.24)
 8 1      EVERTON PK                          95 (28899.23 28663.72)
 9 1      GHIM MOH RD                        220 (22829.05 32797.14)
10 1      HAIG RD                            219  (35165.43 32621.5)
# ℹ 10,171 more rows

Count the number of HDBs in each hexagonal grid

bus_count$`hdb_count`<- lengths(
  st_intersects(
    bus_count, hdb_sf))

summary(bus_count$hdb_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   12.15   22.00   82.00

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(bus_count %>% filter(if_any(hdb_count, ~. > 0))) +
  tm_fill('hdb_count',
          alpha= 0.7) +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 1372)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "HDB Population density (by HDB Count)",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Total dwelling units in each hexagonal grid

Though hdb_count could be used as a proxy for population, we however need to note that different HDBs have different sizes, some may have more units, while other may have less units. Thus, we will also look at the total number of dwelling units at each hexagonal grid.

# Perform the spatial join
hdb_hex<- st_join(bus_count, hdb_sf, join = st_intersects)

hdb_hex <- hdb_hex %>%
  group_by(grid_busstop_id) %>%
  summarise(Total_dwelling_units = sum(total_dwelling_units)) %>%
  drop_na()

hdb_hex1 <- hdb_hex %>% 
  st_drop_geometry()

bus_count <- bus_count %>%
  left_join(hdb_hex1 %>% 
              select(grid_busstop_id, Total_dwelling_units),
            by = "grid_busstop_id")

bus_count <- bus_count%>%
  mutate(Total_dwelling_units = replace_na(Total_dwelling_units, 0))

Show the code

tmap_options(check.and.fix = TRUE)
tm_shape(honeycomb_count) +
  tm_polygons() +
  
  tm_shape(hdb_hex %>% filter(if_any(Total_dwelling_units, ~. > 0))) +
  tm_fill('Total_dwelling_units',
          alpha= 0.7) +
  
  tm_shape(flowLine %>%
             filter(EVENING_PEAK > 1372)) +
  tm_lines(lwd = "EVENING_PEAK",
           style = "quantile",
           col = "red",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3) +
  
  tm_layout(main.title = "HDB Population density (By Total dwelling units)",
            main.title.position = "center",
            main.title.size = 1.0,
            legend.height = 0.35, 
            legend.width = 0.35,
            frame = TRUE) +

  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: LTA DataMall", 
             position = c("left", "bottom"))

Looking at both HDB population density plots (HDB count and Total dwelling units) above, we can see that the East and North-East regions have a higher population density. Also, we can see that the bus commute flow seems to correspond with where the residential areas are at. This could be because bus stops are nearer to these residential areas as compared to MRT stations.

6.4 Append all attributes to flow_data

Before appending the attributes data to flow_data, change the grid_busstop_id to factor data type.

bus_count$grid_busstop_id <- as.factor(bus_count$grid_busstop_id)

Next, we are going to append all the Origin attributes from bus_count dataframe onto flow_data1 dataframe and rename the it as flow_data2.

flow_data2 <- flow_data1 %>%
  left_join(bus_count,
            by = c(ORIGIN_BUS_STOP_ID = 'grid_busstop_id')) %>%
  select(1:6, 8:10) %>%
  rename(ORIGIN_biz_count = biz_count,
         ORIGIN_finserv_count = finserv_count,
         ORIGIN_school_count = sch_count)

Below code chunk is used to append all the Destination attributes from bus_count dataframe onto flow_data2 dataframe.

flow_data2 <- flow_data2 %>%
  left_join(bus_count,
            by = c(DESTIN_BUS_STOP_ID = 'grid_busstop_id')) %>%
  rename(DESTIN_MRT_COUNT = MRT_COUNT,
         DESTIN_fnb_count = fnb_count,
         DESTIN_leisure_count = leisure_count,
         DESTIN_entertn_count = entertn_count,
         DESTIN_retail_count = retail_count,
         DESTIN_hdb_count = hdb_count,
         DESTIN_total_dwelling_units = Total_dwelling_units) %>%
  select(-c( n_colli_busstop,area_bus_grid,biz_count,finserv_count,sch_count))

Next, save the output data file flow_data2 in rds data file format.

write_rds(flow_data2 , "data/rds/flow_data2.rds")

6.5 Calibrating Spatial Interaction Models

In this section, we will calibrate the Spatial Interaction Models using Poisson Regression method.

6.5.1 Import the modelling data

Firstly, let’s import the modelling data by using the code chunk below.

flow_data2 <- read_rds("data/rds/flow_data2.rds")

6.5.2 Visualising the dependent variable

Next, let’s plot the distribution of the dependent variable (i.e. TRIPS) using the code chunk below.

Show the code

ggplot(data = flow_data2,
       aes(x = TRIPS)) +
  geom_histogram()

Notice that the distribution is highly skewed and not resemble bell shape (also known as normal distribution).

Next, let’s visualise the relation between the dependent variable and one of the key independent variable in Spatial Interaction Model, namely distance.

Show the code

ggplot(data = flow_data2,
       aes(x = DIST,
           y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)

Notice that the relationship hardly resemble linear relationship.

On the other hand, if we plot the scatter plot by using the transformed version of both variables, we can see that their relationship resembles a more linear relationship.

Show the code

ggplot(data = flow_data2,
       aes(x = log(DIST),
           y = log(TRIPS))) +
  geom_point() +
  geom_smooth(method = lm)

6.5.3 Checking for variables with zero values

Since Poisson Regression is based of log and log 0 is undefined, it is important for us to ensure that no 0 values in the explanatory variables.

In the code chunk below, summary() of Base R is used to compute the summary statistics of all variables in flow_data2 data frame.

summary(flow_data2)

 ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID     TRIPS           FlowNoIntra      
 1474   :  365      1474   :  327      Min.   :     1.0   Min.   :     1.0  
 1452   :  345      1516   :  308      1st Qu.:     8.0   1st Qu.:     8.0  
 1960   :  303      1960   :  303      Median :    37.0   Median :    37.0  
 1496   :  299      1477   :  296      Mean   :   354.2   Mean   :   354.2  
 1432   :  294      1496   :  294      3rd Qu.:   163.0   3rd Qu.:   163.0  
 1477   :  292      1749   :  281      Max.   :132650.0   Max.   :132650.0  
 (Other):66348      (Other):66437                                           
     offset       DIST       ORIGIN_biz_count ORIGIN_finserv_count
 Min.   :1   Min.   :  750   Min.   : 0.000   Min.   :  0.000     
 1st Qu.:1   1st Qu.: 3269   1st Qu.: 0.000   1st Qu.:  1.000     
 Median :1   Median : 5662   Median : 1.000   Median :  3.000     
 Mean   :1   Mean   : 6501   Mean   : 7.053   Mean   :  8.469     
 3rd Qu.:1   3rd Qu.: 9124   3rd Qu.: 6.000   3rd Qu.:  9.000     
 Max.   :1   Max.   :24784   Max.   :97.000   Max.   :139.000     
                                                                  
 ORIGIN_school_count DESTIN_MRT_COUNT DESTIN_fnb_count  DESTIN_leisure_count
 Min.   :0.0000      Min.   : 0.000   Min.   :  0.000   Min.   : 0.000      
 1st Qu.:0.0000      1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.: 0.000      
 Median :0.0000      Median : 0.000   Median :  0.000   Median : 1.000      
 Mean   :0.5327      Mean   : 1.293   Mean   :  5.234   Mean   : 2.217      
 3rd Qu.:1.0000      3rd Qu.: 2.000   3rd Qu.:  2.000   3rd Qu.: 3.000      
 Max.   :4.0000      Max.   :13.000   Max.   :133.000   Max.   :41.000      
                                                                            
 DESTIN_entertn_count DESTIN_retail_count DESTIN_hdb_count
 Min.   :0.0000       Min.   :   0.00     Min.   : 0.00   
 1st Qu.:0.0000       1st Qu.:   7.00     1st Qu.: 0.00   
 Median :0.0000       Median :  28.00     Median : 8.00   
 Mean   :0.3133       Mean   :  91.09     Mean   :16.62   
 3rd Qu.:0.0000       3rd Qu.:  98.00     3rd Qu.:29.00   
 Max.   :9.0000       Max.   :1678.00     Max.   :82.00   
                                                          
 DESTIN_total_dwelling_units
 Min.   :   0               
 1st Qu.:   0               
 Median : 912               
 Mean   :1807               
 3rd Qu.:3250               
 Max.   :7946

The print report above reveals that variables ORIGIN_biz_count, ORIGIN_finserv_count, ORIGIN_school_count, DESTIN_MRT_COUNT, DESTIN_fnb_count, DESTIN_leisure_count, DESTIN_entertn_count, DESTIN_retail_count, DESTIN_hdb_count and DESTIN_total_dwelling units.

In view of this, the code chunk below will be used to replace 0 values to 0.99.

Show the code

flow_data2$DESTIN_MRT_COUNT <- ifelse(
  flow_data2$DESTIN_MRT_COUNT == 0,
  0.99, flow_data2$DESTIN_MRT_COUNT)

flow_data2$DESTIN_fnb_count <- ifelse(
  flow_data2$DESTIN_fnb_count == 0,
  0.99, flow_data2$DESTIN_fnb_count)

flow_data2$DESTIN_leisure_count <- ifelse(
  flow_data2$DESTIN_leisure_count == 0,
  0.99, flow_data2$DESTIN_leisure_count)

flow_data2$DESTIN_entertn_count <- ifelse(
  flow_data2$DESTIN_entertn_count == 0,
  0.99, flow_data2$DESTIN_entertn_count)

flow_data2$DESTIN_retail_count <- ifelse(
  flow_data2$DESTIN_retail_count == 0,
  0.99, flow_data2$DESTIN_retail_count)

flow_data2$DESTIN_hdb_count <- ifelse(
  flow_data2$DESTIN_hdb_count == 0,
  0.99, flow_data2$DESTIN_hdb_count)

flow_data2$DESTIN_total_dwelling_units <- ifelse(
  flow_data2$DESTIN_total_dwelling_units == 0,
  0.99, flow_data2$DESTIN_total_dwelling_units)

flow_data2$ORIGIN_biz_count <- ifelse(
  flow_data2$ORIGIN_biz_count == 0,
  0.99, flow_data2$ORIGIN_biz_count)

flow_data2$ORIGIN_finserv_count <- ifelse(
  flow_data2$ORIGIN_finserv_count == 0,
  0.99, flow_data2$ORIGIN_finserv_count)

flow_data2$ORIGIN_school_count <- ifelse(
  flow_data2$ORIGIN_school_count == 0,
  0.99, flow_data2$ORIGIN_school_count)

Run the summary() again to check the variables.

summary(flow_data2)

 ORIGIN_BUS_STOP_ID DESTIN_BUS_STOP_ID     TRIPS           FlowNoIntra      
 1474   :  365      1474   :  327      Min.   :     1.0   Min.   :     1.0  
 1452   :  345      1516   :  308      1st Qu.:     8.0   1st Qu.:     8.0  
 1960   :  303      1960   :  303      Median :    37.0   Median :    37.0  
 1496   :  299      1477   :  296      Mean   :   354.2   Mean   :   354.2  
 1432   :  294      1496   :  294      3rd Qu.:   163.0   3rd Qu.:   163.0  
 1477   :  292      1749   :  281      Max.   :132650.0   Max.   :132650.0  
 (Other):66348      (Other):66437                                           
     offset       DIST       ORIGIN_biz_count ORIGIN_finserv_count
 Min.   :1   Min.   :  750   Min.   : 0.990   Min.   :  0.990     
 1st Qu.:1   1st Qu.: 3269   1st Qu.: 0.990   1st Qu.:  1.000     
 Median :1   Median : 5662   Median : 1.000   Median :  3.000     
 Mean   :1   Mean   : 6501   Mean   : 7.484   Mean   :  8.707     
 3rd Qu.:1   3rd Qu.: 9124   3rd Qu.: 6.000   3rd Qu.:  9.000     
 Max.   :1   Max.   :24784   Max.   :97.000   Max.   :139.000     
                                                                  
 ORIGIN_school_count DESTIN_MRT_COUNT DESTIN_fnb_count  DESTIN_leisure_count
 Min.   :0.990       Min.   : 0.990   Min.   :  0.990   Min.   : 0.990      
 1st Qu.:0.990       1st Qu.: 0.990   1st Qu.:  0.990   1st Qu.: 0.990      
 Median :0.990       Median : 0.990   Median :  0.990   Median : 1.000      
 Mean   :1.172       Mean   : 1.919   Mean   :  5.771   Mean   : 2.619      
 3rd Qu.:1.000       3rd Qu.: 2.000   3rd Qu.:  2.000   3rd Qu.: 3.000      
 Max.   :4.000       Max.   :13.000   Max.   :133.000   Max.   :41.000      
                                                                            
 DESTIN_entertn_count DESTIN_retail_count DESTIN_hdb_count
 Min.   :0.990        Min.   :   0.99     Min.   : 0.99   
 1st Qu.:0.990        1st Qu.:   7.00     1st Qu.: 0.99   
 Median :0.990        Median :  28.00     Median : 8.00   
 Mean   :1.167        Mean   :  91.16     Mean   :16.98   
 3rd Qu.:0.990        3rd Qu.:  98.00     3rd Qu.:29.00   
 Max.   :9.000        Max.   :1678.00     Max.   :82.00   
                                                          
 DESTIN_total_dwelling_units
 Min.   :   0.99            
 1st Qu.:   0.99            
 Median : 912.00            
 Mean   :1807.34            
 3rd Qu.:3250.00            
 Max.   :7946.00

Notice that all the 0 values have been replaced by 0.99.

6.5.4 Unconstrained Model

In this section,we will calibrate an unconstrained spatial interaction model by using glm() of Base Stats. The explanatory variables are origin population by different busineses, schools, financial services, destination population by MRT Station Exits, Leisure Activities and total dwelling units and distance between origin and destination in km (i.e. dist).

The general formula of Unconstrained Spatial Interaction Model

The code chunk used to calibrate to model is shown below:

uncflow <- glm(formula = TRIPS ~ 
                 log(ORIGIN_biz_count) + 
                 log(ORIGIN_finserv_count) +
                 log(ORIGIN_school_count) +
                 log(DESTIN_MRT_COUNT) +
                 log(DESTIN_fnb_count) +
                 log(DESTIN_leisure_count) +
                 log(DESTIN_entertn_count) +
                 log(DESTIN_retail_count) +
                 log(DESTIN_hdb_count) +
                 log(DESTIN_total_dwelling_units) +
                 log(DIST),
               family = poisson(link = "log"),
               data = flow_data2,
               na.action = na.exclude)

#save the output in rds file 
write_rds(uncflow, "data/rds/uncflow.rds")


Call:  glm(formula = TRIPS ~ log(ORIGIN_biz_count) + log(ORIGIN_finserv_count) + 
    log(ORIGIN_school_count) + log(DESTIN_MRT_COUNT) + log(DESTIN_fnb_count) + 
    log(DESTIN_leisure_count) + log(DESTIN_entertn_count) + log(DESTIN_retail_count) + 
    log(DESTIN_hdb_count) + log(DESTIN_total_dwelling_units) + 
    log(DIST), family = poisson(link = "log"), data = flow_data2, 
    na.action = na.exclude)

Coefficients:
                     (Intercept)             log(ORIGIN_biz_count)  
                        16.18643                          -0.03511  
       log(ORIGIN_finserv_count)          log(ORIGIN_school_count)  
                         0.42101                           0.13583  
           log(DESTIN_MRT_COUNT)             log(DESTIN_fnb_count)  
                         0.36511                          -0.04208  
       log(DESTIN_leisure_count)         log(DESTIN_entertn_count)  
                        -0.08428                           0.02249  
        log(DESTIN_retail_count)             log(DESTIN_hdb_count)  
                         0.03129                           0.55035  
log(DESTIN_total_dwelling_units)                         log(DIST)  
                        -0.09394                          -1.47462  

Degrees of Freedom: 68245 Total (i.e. Null);  68234 Residual
Null Deviance:      96930000 
Residual Deviance: 45890000     AIC: 46260000

6.5.5 Origin- (Production-) Constrained Model

Figure below shows the general formula of the origin-constrained model.

Code chunk below shows the calibration of the model by using glm() of R and flow_data2.

orcflow <- glm(formula = TRIPS ~ 
                 ORIGIN_BUS_STOP_ID +
                 log(DESTIN_MRT_COUNT) +
                 log(DESTIN_fnb_count) +
                 log(DESTIN_leisure_count) +
                 log(DESTIN_entertn_count) +
                 log(DESTIN_hdb_count) +
                 log(DESTIN_total_dwelling_units) +
                 log(DESTIN_retail_count) +
                 log(DIST) - 1,
               family = poisson(link = "log"),
               data = flow_data2,
               na.action = na.exclude)

#save the output in rds file 
write_rds(orcflow, "data/rds/orcflow.rds")

Things to learn from code chunk

For origin-constrained model, only explanatory variables representing the attractiveness at the destinations will be used.
All the explanatory variables including distance will be log transformed.
ORIGIN_BUS_STOP_ID is used to model 𝜇𝑖 . It must be in categorical data type.
It is important to note that -1 is added in the equation after the distance variable. The -1 serves the purpose of removing the intercept that by default, glm will insert into the model.

# Extract the summary of the model
summary_orcflow <- summary(orcflow)

# Display the last 10 coefficient results
tail(summary_orcflow$coefficients, 10)

                                     Estimate   Std. Error      z value
ORIGIN_BUS_STOP_ID2427           17.394118366 0.0085090459  2044.191390
ORIGIN_BUS_STOP_ID2505           17.428262800 0.0257404798   677.076066
log(DESTIN_MRT_COUNT)             0.461752870 0.0003555953  1298.534837
log(DESTIN_fnb_count)            -0.002873364 0.0003183601    -9.025515
log(DESTIN_leisure_count)        -0.033801045 0.0003724946   -90.742378
log(DESTIN_entertn_count)        -0.038056644 0.0008567013   -44.422301
log(DESTIN_hdb_count)             0.432411867 0.0005023495   860.778873
log(DESTIN_total_dwelling_units) -0.039218349 0.0002257349  -173.736288
log(DESTIN_retail_count)          0.063396865 0.0001770021   358.170172
log(DIST)                        -1.501828723 0.0002604113 -5767.140144
                                     Pr(>|z|)
ORIGIN_BUS_STOP_ID2427           0.000000e+00
ORIGIN_BUS_STOP_ID2505           0.000000e+00
log(DESTIN_MRT_COUNT)            0.000000e+00
log(DESTIN_fnb_count)            1.788521e-19
log(DESTIN_leisure_count)        0.000000e+00
log(DESTIN_entertn_count)        0.000000e+00
log(DESTIN_hdb_count)            0.000000e+00
log(DESTIN_total_dwelling_units) 0.000000e+00
log(DESTIN_retail_count)         0.000000e+00
log(DIST)                        0.000000e+00

The model results shows that the number of MRTs, HDB blocks, Retails and F&Bs available have a statistically significant relationship with the number of weekday evening trips. In particular, the number of MRTs and HDB blocks have the highest positive association, at 0.462 and 0.432 respectively. This is unsurprising as the working population or students may have just end their day and they are either taking the bus home or transfer to nearest MRT station or taking a different bus. Conversely, the number of leisure and entertainment available has a negative association with the volume of trips, suggesting that destinations further away are less appealing for travel. This trend might be attributed to the dataset reflecting trips taken early or mid-week where individuals may be less inclined to travel long distances.

6.5.6 Destination Constrained Model

In this section, we will fit a destination constrained SIM by using the code chunk below.

The general formula of Destination Constrained Spatial Interaction Model,

decflow <- glm(formula = TRIPS ~ 
                 DESTIN_BUS_STOP_ID + 
                 log(ORIGIN_biz_count) + 
                 log(ORIGIN_finserv_count) +
                 log(ORIGIN_school_count) +
                 log(DIST)-1,
               family = poisson(link = "log"),
               data = flow_data2,
               na.action = na.exclude)

#save the output in rds file 
write_rds(decflow, "data/rds/decflow.rds")

Things to learn from code chunk

For destination-constrained model, only explanatory variables representing the propulsiveness at the origins will be used.
All the explanatory variables including distance will be log transformed.
DESTIN_BUS_STOP_ID is used to model αi . It must be in categorical data type.
It is important to note that -1 is added in the equation after the distance variable. The -1 serves the purpose of removing the intercept that by default, glm will insert into the model.

# Extract the summary of the model
summary_decflow <- summary(decflow)

# Display the last 10 coefficient results
tail(summary_decflow$coefficients, 10)

                             Estimate   Std. Error     z value Pr(>|z|)
DESTIN_BUS_STOP_ID2384    16.52166146 0.0162567710  1016.29416        0
DESTIN_BUS_STOP_ID2405    16.04568502 0.0173874126   922.83340        0
DESTIN_BUS_STOP_ID2406    14.29744785 0.0468686059   305.05383        0
DESTIN_BUS_STOP_ID2426    15.93981848 0.0319477321   498.93427        0
DESTIN_BUS_STOP_ID2427    16.94039034 0.0130352960  1299.57849        0
DESTIN_BUS_STOP_ID2505    15.33260264 0.0741558881   206.76177        0
log(ORIGIN_biz_count)      0.01733568 0.0001839409    94.24595        0
log(ORIGIN_finserv_count)  0.49079906 0.0001800543  2725.83865        0
log(ORIGIN_school_count)   0.15051552 0.0006865062   219.24858        0
log(DIST)                 -1.51858814 0.0002567294 -5915.13078        0

The model results shows that all the origin attributes has a statistically significant relationship with the number of weekday evening peak trips, with the number of financial services having the highest positive association at 0.491. This suggests that there is a stronger tendency for people to commence their journey from work or school locations, which makes sense, as it aligns with the typical end of day routines for most individuals.

6.5.7 Doubly Constrained Model

In this section, we will fit a doubly constrained SIM by using the code chunk below.

The general formula of Doubly Constrained Spatial Interaction Model,

dbcflow <- glm(formula = TRIPS ~ 
                 ORIGIN_BUS_STOP_ID + 
                 DESTIN_BUS_STOP_ID + 
                 log(DIST),
               family = poisson(link = "log"),
               data = flow_data2,
               na.action = na.exclude)

#save the output in rds file 
write_rds(dbcflow, "data/rds/dbcflow.rds")

Important

It is important to note that there is a slight change of the code chunk. I have removed the -1 which means that an intercept will appear in the model again. This is not because I want an intercept as it makes the origin and destination coefficients harder to interpret, rather the -1 cheat for removing the intercept only works with one factor level but in double-constrained model we have two factor levels, namely: origins and destinations.

6.5.8 Model Comparison

6.5.8.1 Goodness of fit

R-squared function

In statistical modelling, the next question we would like to answer is how well the proportion of variance in the dependent variable (i.e. TRIPS) that can be explained by the explanatory variables.

In order to provide an answer to this question, R-squared statistics will be used. However, R-squared is not an output of glm(). Hence, we will write a function called CalcRSquared by using the code chunk below.

CalcRSquared <- function(observed,estimated){
  r <- cor(observed,estimated)
  R2 <- r^2
  R2
}

Next, we will compute the R-squared of the four SIMs by using the code chunk below.

CalcRSquared(uncflow$data$TRIPS, uncflow$fitted.values)

[1] 0.2347295

CalcRSquared(orcflow$data$TRIPS, orcflow$fitted.values)

[1] 0.4842719

CalcRSquared(decflow$data$TRIPS, decflow$fitted.values)

[1] 0.3463741

CalcRSquared(dbcflow$data$TRIPS, dbcflow$fitted.values)

[1] 0.6294211

The R-square results of the four SIM shows that doubly constrained spatial interaction model gives the best results (62.9%) as compared to other models. It suggests that 62.9% of the observed variation for the dependent variable (i.e Trips) can be explained by the the model’s inputs/variables.

Root Mean Squared Error (RSME)

Another useful model performance measure for continuous dependent variable is Root Mean Squared Error. In this sub-section, we will use compare_performance() of performance package.

First of all, let us create a list called model_list by using the code chunk below.

model_list <- list(unconstrained=uncflow,
                   originConstrained=orcflow,
                   destinationConstrained=decflow,
                   doublyConstrained=dbcflow)

Next, we will compute the RMSE of all the models in model_list file by using the code chunk below.

compare_performance(model_list,
                    metrics = "RMSE")

# Comparison of Model Performance Indices

Name                   | Model |     RMSE
-----------------------------------------
unconstrained          |   glm | 1524.153
originConstrained      |   glm | 1251.151
destinationConstrained |   glm | 1408.526
doublyConstrained      |   glm | 1060.575

The print above reveals that doubly constrained SIM is the best model among all the four SIMs, as it has the smallest RMSE value of 1060.575.

6.5.8.2 Visualising fitted values

In this section, we will visualise the observed values and the fitted values.

First, we will extract the fitted values from Unconstrained Model by using the code chunk below.

df <- as.data.frame(uncflow$fitted.values) %>%
  round(digits = 0)

Next we will append the fitted values into flow_data2 data frame by using the code chunk below.

flow_data2 <- flow_data2 %>%
  cbind(df) %>%
  rename(uncTRIPS = "uncflow$fitted.values")

Repeat the same step for Origin Constrained SIM (i.e. orcflow)

df <- as.data.frame(orcflow$fitted.values) %>%
  round(digits = 0)

flow_data2 <- flow_data2 %>%
  cbind(df) %>%
  rename(orcTRIPS = "orcflow$fitted.values")

Repeat the same step for Destination Constrained SIM (i.e. decflow)

df <- as.data.frame(decflow$fitted.values) %>%
  round(digits = 0)

flow_data2 <- flow_data2 %>%
  cbind(df) %>%
  rename(decTRIPS = "decflow$fitted.values")

Repeat the same step for Doubly Constrained SIM (i.e. dbcflow)

df <- as.data.frame(dbcflow$fitted.values) %>%
  round(digits = 0)

flow_data2 <- flow_data2 %>%
  cbind(df) %>%
  rename(dbcTRIPS = "dbcflow$fitted.values")

Next, four scatterplots will be created using geom_point() and other appropriate functions of ggplot2 package.

Show the code

unc_p <- ggplot(data = flow_data2,
                aes(x = uncTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm) +
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

orc_p <- ggplot(data = flow_data2,
                aes(x = orcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)+
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

dec_p <- ggplot(data = flow_data2,
                aes(x = decTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)+
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

dbc_p <- ggplot(data = flow_data2,
                aes(x = dbcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm)+
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

#put all graphs into a single visual for better comparison
ggarrange(unc_p, orc_p, dec_p, dbc_p,  
          ncol = 2,
          nrow = 2)

The above plot shows that the observed and fitted values in the doubly constrained model has a stronger linear trend as compared to other models.

7. Conclusion

Understanding the spatial patterns and determining the factors affecting public bus passenger flows during the different peak periods is essential for urban planning and transportation management, as they help identify areas that may require additional infrastructure, or policy interventions to balance the transit loads and improve overall accessibility.

For this assignment, Weekday evening peak (5pm - 8pm) spatial interaction flow was analysed. It was noted that Doubly constrained Model provided the best result. From the Origin constrained model, it showed that the number of MRTs, HDB blocks, Retails and F&B available are attractive attributes towards the destinations. From the Destination constrained model, it revealed that the number of businesses, financial services and schools are propulsive attributes at the origin.

To further enhance the quality of analysis, the following could be done,

Passenger Volume could include the day of the week, as this could potentially influence travel patterns.
Analysis could be based on a a larger dataset which includes a longer time range, other peak hours, etc.
The Spatial Interaction Model assumes independence among observations, and this assumption seems heroic for many fundamentally spatial problems. It does not consider the propulsiveness and attractiveness of surrounding areas. Spatial Econometric Interaction Models could thus be performed to address this spatial independence problem.