Urban Form and Walkability, Kisumu, Kenya
Geospatial Data Science in Python
The project explores the relationship between urban form and walkability in Kisumu, Kenya, by analyzing Landsat 8 imagery with k-means clustering and assessing street networks through OSMnx. It leverages satellite imagery, machine learning, and urban mapping to understand the impact of city design on walkability, focusing on street density and access to amenities.
​
Background
​
Kisumu is a port city in western Kenya at 1,131 m (3,711 ft) elevation. The port was founded in 1901 as the main inland terminal of the Uganda Railway and named Port Florence. As Kenya's third-largest city, Kisumu is a microcosm of the country's development trajectory. It grapples with the demands of urbanization while striving to provide for its residents' health, well-being, and rights.
Kisumu's urban footprint today showcases a complex interplay of growth, history, and diverse land uses. The city, evolving from its historical origins, now balances modern development with its cultural and historical heritage. Its expansion reflects a range of functionalities, from residential and commercial spaces to industrial areas, all interwoven into the city's unique character.
​
The utilization of scikit-learn for k-means clustering is employed to categorize Landsat 8 imagery, providing insights into the urban landscapes of Kisumu. Following this, OSMnx is applied to analyze the street network of the city, evaluating factors that contribute to walkability, such as street density and access to amenities. This methodology facilitates an assessment of the correlation between urban structure and pedestrian accessibility in Kisumu.
Potential Data Sources
1. ESRI World Imagery Wayback Satellite Data
2. Google Earth Engine
3. OSM Street Maps Data
4. USGS Earth Explorer
​
The code for this repository is hosted on my GitHub page.
Analysing Urban Walkability
​
Intersection density reflects the compactness and connectivity of a street network, key indicators of walkability. A dense network minimizes unreachable areas, while high connectivity allows for multiple routing options.
​
Initially, the walkable network of the study area was downloaded using OSMnx, forming a graph of edges (paths) and nodes (intersections). To address overcomplexity, like minor discrepancies creating multiple nodes for a single intersection, the graph was simplified. Nodes within five meters were merged, and cul-de-sacs were removed, yielding a representation that more accurately reflects true intersections. The increase in nodes from the original to the simplified graph indicates a more complex network structure post-simplification.
​
Number of nodes:
​
Original graph: 5319
Simplified graph: 6153


View Code


#| echo: true
#| code-fold: true
import osmnx as ox
import matplotlib.pyplot as plt
import geopandas as gpd
import seaborn as sns
import os
import numpy as np
import rasterio as rio
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from skimage.exposure import equalize_adapthist
import matplotlib.pyplot as plt
import pandana
%matplotlib inline
#| echo: true
#| code-fold: true
# Select city and crs
cityname = 'kisumu, kenya'
crs = 32636
#| echo: true
#| code-fold: true
# Get graph by geocoding
graph = ox.graph_from_place(cityname, network_type="walk")
# Project graph
graph = ox.projection.project_graph(graph, to_crs=crs)
#| echo: true
#| code-fold: true
# Simplify to get real intersections only
# (consolidate nodes within a distance from eachother)
graph_simplified = ox.simplification.consolidate_intersections(
# Graph to simplify
graph,
# buffer around each node (project the graph beforehand)
tolerance=5,
# Get result as graph (False to get nodes only as gdf)
rebuild_graph=True,
# no dead ends
dead_ends=False,
# Reconnect (False to get intersections only)
reconnect_edges=True
)
#| echo: true
#| code-fold: true
# everything to gdfs
nodes, edges = ox.graph_to_gdfs(graph)
nodes_s, edges_s = ox.graph_to_gdfs(graph_simplified)
#| echo: false
#| code-fold: true
# Setup plot
fig, ax = plt.subplots(figsize=(20,15))
ax.set_axis_off()
ax.set_aspect('equal')
fig.set_facecolor('black')
# Plot data
edges.plot(
ax=ax,
color='white',
linewidth=0.2
)
# Tight layout
plt.tight_layout()
# Save
plt.savefig('./data/graph_overview.png')
​
#| echo: false
#| code-fold: true
# Select city and crs
cityname = 'Kisumu, Kenya'
crs = 'epsg:32636' # Use the string format for CRS to avoid confusion
# Get graph by geocoding
graph = ox.graph_from_place(cityname, network_type="drive")
# Project graph
graph_projected = ox.project_graph(graph, to_crs=crs)
# Simplify the graph to get real intersections only
graph_simplified = ox.consolidate_intersections(
graph_projected,
tolerance=5, # buffer in meters
rebuild_graph=True,
dead_ends=False,
reconnect_edges=True
)
# Convert to geodataframes
nodes, edges = ox.graph_to_gdfs(graph_simplified)
# Setup plot
fig, ax = plt.subplots(figsize=(20, 15))
ax.set_axis_off()
fig.patch.set_facecolor('black')
# Plot drivable streets
edges.plot(
ax=ax,
color='white',
linewidth=0.2
)
# Tight layout for saving
plt.tight_layout()
# Save the figure
plt.savefig('kisumu_drive_network.png', facecolor=fig.get_facecolor())
# Display the plot
plt.show()
​
#| echo: true
#| code-fold: true
# Print info
print(
'number of nodes:\n\noriginal graph: '+str(len(nodes))
+'\nsimplified graph: '+str(len(nodes_s))
)
The increase in nodes from the original to the simplified graph indicates a more complex network structure post-simplification. Utilizing Matplotlib's hexbin and Seaborn’s KDE plots has allowed for a detailed visualization of spatial density variations, offering a deeper understanding of the network’s intricacies.


View Code


#| echo: false
#| code-fold: true
# Set all text color to white
COLOR = 'white'
plt.rcParams['text.color'] = COLOR
plt.rcParams['axes.labelcolor'] = COLOR
plt.rcParams['xtick.color'] = COLOR
plt.rcParams['ytick.color'] = COLOR
# Setup plot
fig, ax = plt.subplots(figsize=(20,15))
ax.set_axis_off()
ax.set_aspect('equal')
fig.set_facecolor('black')
# Plot as hexbins
hb = ax.hexbin(
x=nodes_s['x'],
y=nodes_s['y'],
gridsize=75,
cmap='viridis',
mincnt=1,
vmax=100,
)
# Colorbar
cb = plt.colorbar(hb, ax=ax, shrink=0.8, ticks=[1, 20, 40, 60, 80, 100])
cb.ax.tick_params(color='none', labelsize=20)
cb.ax.set_yticklabels(['1', '20', '40', '60', '80', '>= 100'])
cb.set_label('Intersections per hexagon', fontsize=20, fontweight='bold')
# Tight layout
plt.tight_layout()
# Save
plt.savefig('./data/intersection_hexbin.png')
​
#| echo: false
#| code-fold: true
# Setup plot
fig, ax = plt.subplots(figsize=(30,30))
ax.set_axis_off()
ax.set_aspect('equal')
fig.set_facecolor('black')
# Plot streets underneath
edges_s.plot(ax=ax, color=[1,1,1], linewidth=0.2, zorder= 0)
# Plot KDE on top
sns.kdeplot(
ax=ax,
data=nodes_s,
x='x', y='y',
hue=None,
fill=True,
cmap='mako',
thresh=0,
levels=10,
alpha=0.5,
zorder=10
)
# Tight layout
plt.tight_layout()
# Save
plt.savefig('./data/intersection_kde.png', bbox_inches='tight')
​
#| echo: true
#| code-fold: true
# Save nodes
nodes.to_file('./data/nodes.gpkg', driver='GPKG')
nodes_s.to_file('./data/nodes_simplified.gpkg', driver='GPKG')
The initial analysis equated dense urban fabric with walkability, but urban space encompasses more than just intersections. For my analysis I used the list of OSM features that indicate sociable places, or so called “third places“. With OSMnx and Pandana, a detailed network was created, positioning the POIs. Here, I used the intricate, original graph for accurate travel time computations. Pandana’s routing analysis calculated walking times from each node to ten proximal POIs, assuming a walking speed of 4.5 km/h and capping at a 15-minute walking radius.
The initial analysis equated dense urban fabric with walkability, but urban space encompasses more than just intersections. For my analysis I used the list of OSM features that indicate sociable places, or so called “third places“. With OSMnx and Pandana, a detailed network was created, positioning the POIs. Here, I used the intricate, original graph for accurate travel time computations. Pandana’s routing analysis calculated walking times from each node to ten proximal POIs, assuming a walking speed of 4.5 km/h and capping at a 15-minute walking radius.

View Code


#| echo: true
#| code-fold: true
# Select city and crs
cityname = 'kisumu, kenya'
crs = 32636
#| echo: true
#| code-fold: true
# Get graph by geocoding
graph = ox.graph_from_place(cityname, network_type="walk")
# Project graph
graph = ox.projection.project_graph(graph, to_crs=crs)
#| echo: true
#| code-fold: true
# Select points of interest based on osm tags
tags = {
'amenity':[
'cafe',
'bar',
'pub',
'restaurant'
],
'shop':[
'bakery',
'convenience',
'supermarket',
'mall',
'department_store',
'clothes',
'fashion',
'shoes'
],
'leisure':[
'fitness_centre'
]
}
# Get amentities from place
pois = ox.geometries.geometries_from_place(cityname, tags=tags)
# Project pois
pois = pois.to_crs(epsg=crs)
#| echo: true
#| code-fold: true
# Max time to walk in minutes (no routing to nodes further than this)
walk_time = 15
# Walking speed
walk_speed = 4.5
# Set a uniform walking speed on every edge
for u, v, data in graph.edges(data=True):
data['speed_kph'] = walk_speed
graph = ox.add_edge_travel_times(graph)
# Extract node/edge GeoDataFrames, retaining only necessary columns (for pandana)
nodes = ox.graph_to_gdfs(graph, edges=False)[['x', 'y']]
edges = ox.graph_to_gdfs(graph, nodes=False).reset_index()[['u', 'v', 'travel_time']]
#| echo: true
#| code-fold: true
# Construct the pandana network model
network = pandana.Network(
node_x=nodes['x'],
node_y=nodes['y'],
edge_from=edges['u'],
edge_to=edges['v'],
edge_weights=edges[['travel_time']]
)
# Extract centroids from the pois' geometries
centroids = pois.centroid
#| echo: true
#| code-fold: true
# Specify a max travel distance for analysis
# Minutes -> seconds
maxdist = walk_time * 60
# Set the pois' locations on the network
network.set_pois(
category='pois',
maxdist=maxdist,
maxitems=10,
x_col=centroids.x,
y_col=centroids.y
)
#| echo: true
#| code-fold: true
# calculate travel time to 10 nearest pois from each node in network
distances = network.nearest_pois(
distance=maxdist,
category='pois',
num_pois=10
)
distances.astype(int).head()
#| echo: false
#| code-fold: true
# Set text parameters
COLOR = 'white'
plt.rcParams['text.color'] = COLOR
plt.rcParams['axes.labelcolor'] = COLOR
plt.rcParams['xtick.color'] = COLOR
plt.rcParams['ytick.color'] = COLOR
# Setup plot
fig, ax = plt.subplots(figsize=(20,15))
ax.set_axis_off()
ax.set_aspect('equal')
fig.set_facecolor((0,0,0))
# Plot distance to nearest POI
sc = ax.scatter(
x=nodes['x'],
y=nodes['y'],
c=distances[1],
s=1,
cmap='viridis_r',
)
# Colorbar
cb = fig.colorbar(sc, ax=ax, shrink=0.8, ticks=[0, 300, 600, 900])
cb.ax.tick_params(color='none', labelsize=20)
cb.ax.set_yticklabels(['0', '5', '10', '>= 15'])
cb.set_label('Walking time to nearest POI (minutes)', fontsize=20, fontweight='bold')
# Remove empty space
plt.tight_layout()
# Save
plt.savefig('./data/walk_access.png')
The initial visualization was dense; clarity was improved using Matplotlib’s hexbins, this time mapping average walking times per hex. Notably, showing only the nearest POI can be misleading—a tight cluster of amenities could visually equate to an area with a single amenity. Adjusting the metric to, say, the 5th or 10th nearest POI, can provide a more balanced view, as seen in the comparative visualizations.

View Code


#| echo: false
#| code-fold: true
# Setup plot
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(30,30), constrained_layout=False)
ax[0][0].set_axis_off()
ax[0][1].set_axis_off()
ax[1][0].set_axis_off()
ax[1][1].set_axis_off()
ax[0][0].set_aspect('equal')
ax[0][1].set_aspect('equal')
ax[1][0].set_aspect('equal')
# Specify colors
fig.set_facecolor('black')
cmap = 'viridis_r'
# Plot distance to nearest POI
hb = ax[0][0].hexbin(
x=nodes['x'],
y=nodes['y'],
gridsize=75,
cmap=cmap,
C=distances[1],
alpha=1
)
# Plot distance to 5th nearest POI
hb = ax[0][1].hexbin(
x=nodes['x'],
y=nodes['y'],
gridsize=75,
cmap=cmap,
C=distances[5],
alpha=1
)
# Plot distance to 10th nearest POI
hb = ax[1][0].hexbin(
x=nodes['x'],
y=nodes['y'],
gridsize=75,
cmap=cmap,
C=distances[10],
alpha=1
)
# Add titles to maps
ax[0][0].set_title('Walking time to nearest POI', fontsize=30, fontweight='bold')
ax[0][1].set_title('Walking time to 5th nearest POI', fontsize=30, fontweight='bold')
ax[1][0].set_title('Walking time to 10th nearest POI', fontsize=30, fontweight='bold')
# Setup color bar axis (location of color bar)
cbar_ax = fig.add_axes([0.5, 0.01, 0.5, 0.5])
cbar_ax.set_axis_off()
# Color bar properties
cb = plt.colorbar(
hb,
ax=cbar_ax,
shrink=0.9,
ticks=[82, 300, 600, 900],
orientation='horizontal',
aspect=20
)
cb.outline.set_edgecolor('none')
cb.ax.tick_params(color='none', labelsize=30)
cb.ax.set_xticklabels(['<= 1', '5', '10', '>= 15'])
# Title for color bar
ax[1][1].text(0.5, 0.23, 'Walking time (minutes)', fontsize=30, ha='center')
# Tight layout
plt.tight_layout()
# Save
plt.savefig('./data/walk_access_comparison.png')
Unsupervised Classification of Landsat 8 Imagery
​
Unsupervised classification is valuable in satellite imagery for its efficiency in data compression and anomaly detection, such as cloud and ship identification. It's also useful in semi-supervised learning, where it quickly categorizes unlabeled data, complementing partially labeled datasets.
​
For my analysis of satellite image classification, I first need to set up my coding environment. I'll start by importing all the necessary libraries. These include numpy for handling arrays, rasterio for input/output operations with remotely sensed data, scikit-learn for unsupervised classification and model evaluation, and matplotlib for data visualization. I also use the os library to manage file paths. My focus was on obtaining satellite imagery of Kisumu. For this, I used OSMnx to download the data and ensured the data was in the correct format and projection (CRS) for accurate analysis.
​
Before diving into the analysis, I started by visualizing the scene to understand its characteristics. For this, I use bands 2, 3, and 4 of the Landsat 8 scene. When applying the k-means algorithm, I incorporate a broader range of data, including bands 1 to 11. Since all these bands have a resolution of 30 meters per pixel, there's no need for upsampling or downsampling in the analysis. The Kisumu area was densely packed with buildings and structures, making it challenging to discern vegetation or subtle details in the predominantly brownish images. To enhance visualization, particularly of man-made objects, I utilized a different band combination: 7-6-4. This selection significantly emphasizes these features. I adjusted the red, green, and blue arguments in the show_rgb function to accommodate this new combination.


View Code


#| echo: true
#| code-fold: true
import os
import numpy as np
import rasterio as rio
from rasterio.mask import mask
from shapely.geometry import Point, mapping
from shapely.geometry.polygon import Polygon
from shapely.geometry import Point
import geopandas as gpd
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from skimage.exposure import equalize_adapthist
import matplotlib.pyplot as plt
#| echo: true
#| code-fold: true
#loading data
path = 'data/landsat8'
complete_dataset = os.listdir(path)
complete_dataset = [path + x for x in complete_dataset]
print(complete_dataset)
#| echo: true
#| code-fold: true
path = 'data/landsat8'
bands_list = [os.path.join(path, filename) for filename in os.listdir(path) if filename.endswith('.tif')]
#| echo: true
#| code-fold: true
def show_rgb(bands_list, red=4, green=3, blue=2):
stack = []
for band_number in [red, green, blue]:
# Adjust the pattern to match the filename structure we see in the output
band_file = next((band for band in bands_list if f"T1_B{band_number}.tif" in band), None)
if band_file:
with rio.open(band_file) as src:
array = src.read(1)
# Normalize the array
array = array.astype('float32')
array_min, array_max = array.min(), array.max()
array = (array - array_min) / (array_max - array_min)
stack.append(array)
else:
print(f"No band file found for band number {band_number}")
if stack: # Check if stack is not empty
stack = np.dstack(stack)
# Display the image
plt.figure(figsize=(15, 15))
plt.axis('off')
plt.imshow(stack)
plt.show()
else:
print("No arrays were added to stack, check the band files.")
#| echo: true
#| code-fold: true
# Example of using bands_list
band_number = 4 # Replace with the correct band number you are looking for
band_file = next((band for band in bands_list if f"T1_B{band_number}.tif" in band), None)
#| echo: true
#| code-fold: true
band_file = next((band for band in bands_list if f"T1_B{band_number}.tif" in band), None)
#| echo: false
#| code-fold: true
path = 'data/landsat8' # Adjust the path as needed
bands_list = [os.path.join(path, filename) for filename in os.listdir(path) if filename.endswith('.tif')] # Make sure the extension matches
print(bands_list) # Check the band paths
show_rgb(bands_list, red=4, green=3, blue=2) # For natural color
#| echo: true
#| code-fold: true
show_rgb(bands_list, red=5, green=4, blue=3)
For the models development, I employed the KMeans package to classify pixel values in the satellite images. I experimented with different numbers of clusters to understand which best captures the terrain types in the scene. To determine the number of clusters that best represent the terrain types, I experimented with 3 and 6 clusters, stacking all bands for a comprehensive analysis. After comparing the resulting images with the original RGB scene, I conclude that six categories yield the most optimal results. Three categories fail to capture the spatial variability, with a significant presence of NaN values obscuring useful data.

Cluster 0: 29.80%
Cluster 1: 14.53%
Cluster 2: 4.30%
Cluster 3: 11.55%
Cluster 4: 28.88%
Cluster 5: 10.93%

Cluster 0: 42.46%
Cluster 1: 15.46%
Cluster 2: 42.09%
View Code


#| echo: true
#| code-fold: true
# Stack all bands
def stack_bands(bands_list):
stack = []
for band_path in bands_list:
with rio.open(band_path) as src:
stack.append(src.read(1).flatten()) # Flatten each band
return np.stack(stack, axis=1)
for band_path in bands_list:
with rio.open(band_path) as src:
array = src.read(1)
print(array.shape) # Print out the shape of each array
def stack_bands(bands_list):
stack = []
for band_path in bands_list:
with rio.open(band_path) as src:
array = src.read(1)
# Only append bands that match the desired shape (279, 280) for example
if array.shape == (279, 280):
stack.append(array.flatten()) # Flatten each band
else:
print(f"Excluded {band_path} due to shape mismatch: {array.shape}")
return np.vstack(stack).T # Use vstack and transpose to get the correct shape for KMeans
# Get the stacked data excluding any mismatched bands
data_stack = stack_bands(bands_list)
# Stack all bands
def stack_bands(bands_list):
stack = []
for band_path in bands_list:
with rio.open(band_path) as src:
stack.append(src.read(1).flatten()) # Flatten each band
return np.stack(stack, axis=1)
#| echo: true
#| code-fold: true
# Reshape the data
data_stack = stack_bands(bands_list)
# Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data_stack)
# Determine the optimal number of clusters
# Use methods like the Elbow method or Silhouette score here
# Sample for silhouette score
range_n_clusters = [2, 3, 4, 5, 6]
silhouette_avg = []
for num_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(data_standardized)
silhouette_avg.append(silhouette_score(data_standardized, cluster_labels))
#| echo: true
#| code-fold: true
# Choose the optimal number of clusters
optimal_clusters = range_n_clusters[silhouette_avg.index(max(silhouette_avg))]
# Apply K-Means Clustering
kmeans6 = KMeans(n_clusters=6, random_state=42)
y_pred = kmeans.fit_predict(data_standardized)
# Reshape classified data back to the original image shape
with rio.open(bands_list[0]) as src:
meta = src.meta
original_shape = (src.height, src.width)
classified_image6 = y_pred.reshape(original_shape)
# Visualize the result
plt.figure(figsize=(10, 10))
plt.imshow(classified_image6, cmap='terrain')
plt.colorbar()
plt.show()
# Assuming y_pred is your flattened array of predictions from KMeans
# and classified_image is your 2D array of cluster labels reshaped into the image's dimensions
# Count the occurrences of each cluster label
unique, counts = np.unique(classified_image6, return_counts=True)
cluster_counts6 = dict(zip(unique, counts))
# Calculate the total number of pixels
total_pixels6 = classified_image6.size
# Calculate the percentage of each cluster
cluster_percentages6 = {k: (v / total_pixels6) * 100 for k, v in cluster_counts6.items()}
# Print the percentages
for cluster, percentage in cluster_percentages6.items():
print(f'Cluster {cluster}: {percentage:.2f}%')
# Apply K-Means Clustering
kmeans3 = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans3.fit_predict(data_standardized)
# Reshape classified data back to the original image shape
with rio.open(bands_list[0]) as src:
meta = src.meta
original_shape = (src.height, src.width)
classified_image3 = y_pred.reshape(original_shape)
# Visualize the result
plt.figure(figsize=(10, 10))
plt.imshow(classified_image3, cmap='terrain')
plt.colorbar()
plt.show()
# Assuming y_pred is your flattened array of predictions from KMeans
# and classified_image is your 2D array of cluster labels reshaped into the image's dimensions
# Count the occurrences of each cluster label
unique, counts = np.unique(classified_image3, return_counts=True)
cluster_counts3 = dict(zip(unique, counts))
# Calculate the total number of pixels
total_pixels3 = classified_image3.size
# Calculate the percentage of each cluster
cluster_percentages3 = {k: (v / total_pixels3) * 100 for k, v in cluster_counts3.items()}
# Print the percentages
for cluster, percentage in cluster_percentages3.items():
print(f'Cluster {cluster}: {percentage:.2f}%')
clustered_models = ClusteredBands(complete_dataset)
clustered_models.set_raster_stack()
Conclusion
​
Initial visual assessment of the walkability map and the unsupervised classification map suggests a potential relationship between urban land use clusters and areas with higher walkability scores in Kisumu. Urban clusters, as identified by the classification algorithm, appear to correspond with regions that have shorter walking times to points of interest, which may indicate better walkability. Conversely, less walkable areas might align with land use clusters that are characteristic of rural or undeveloped regions. To confirm and quantify this relationship, a comprehensive spatial analysis is recommended.