Assessing the performance of AI-assisted mapping of building footprints for OSM

Room: Tsavo Hall

Sunday, 09:30
Duration: 20 minutes (plus Q&A)


Slides pdf files for SoTM talk

Figure 1. Sample of 256x256 tiles at zoom 19 for labels (binary masks) and adjacent RGB images for four of the sixteen urban regions.

Figure 2. Loss (left axis, grey) and four metrics (right axis, coloured) values for four of the sixteen training data sets, 20 epochs and batch size 16; train and validation values are both shown (continuous and dashed line).

Figure 3. Example of categorical accuracy plotted against other three metrics, with colors indicating the epoch of the final accuracy value (max categorical accuracy)

Figure 4. Heatmap plots of the final values of five validation accuracy metrics for three urban characteristics: urbanity type, density level, roof cover (epoch 20, batch size 8).


video on media.ccc.de
video on YouTube


Back to schedule
  • Anna Zanchetta

fAIr is an open AI-assisted mapping service developed by the Humanitarian OpenStreetMap Team (HOT), with the aim of improving and assisting mapping for humanitarian aid and disaster relief. The proposal illustrates the research undertaken to assess the performance of fAIR underlining ML model training, from the training datasets selection process, the choice of the metrics used to measure accuracy, and finally the analysis of the results obtained testing for different metrics. The research falls within the broader spectrum of research on understanding the fine tuning process for geographic domain adaptation in image analysis validation, particularly for building footprints detection.


Introduction Building footprints features are useful in a wide range of applications such as disaster assessment, urban planning, and environmental monitoring (You et al., 2018; Owusu et al., 2021; Yang, Matsushita, & Zhang, 2023), and their identification has been gaining increasing interest and attention from the ML research for Earth Observation (Hoeser, Bachofer, & Kuenzer, 2020). Particularly in the disaster response context, accurate and prompt availability of such information is crucial (Boccardo & Giulio Tonolo, 2015; Deng, 2022; Sun et al., 2022). fAIr, developed by the Humanitarian OpenStreetMap Team (HOT), a fully open AI-assisted mapping service to generate semi-automated building footprints features, addresses this need (fAIr website, 2022). fAIr stands for “Free and open source AI that is resilient for local contexts, and represents the Responsibility of HOT for local communities and humanitarian mapping", reflecting the objective of HOT to improve and assist mapping for humanitarian aid and disaster relief. Useful open-source data sets of AI-generated building footprints exist (e.g. Microsoft’s global buildings dataset, available through Rapid, and Google’s Open Buildings for Africa and the Global South at large), however, the Machine Learning (ML) models are not currently open-sourced. fAIr, on the other hand, addresses the lack AI models openness by being a fully open source project (HOT Tech Blog, 2022). While in OSM building footprints mapping is currently supported in most countries through Rapid, "users should take care to ensure adjustments and corrections are made as needed" (Rapid - OSM wiki, 2024); fAIr goes over this issue by reintroducing the human in the loop. In fact, at its current state, fAIr allows OSM mappers to create their own local training dataset, train/fine-tune a pre-trained Eff-UNet model, and then map into OSM with the assistance of their own local model. In its initial release, the performance of the model following training was not assessed, thus the objective of this research is to address this gap. This proposal describes the research developed in recent months to assess how the ML fine-tuning process performs, investigating the currently used accuracy metric, and comparing against different sets of evaluation metrics. The final aim of this research is to advise on the optimal metric for this building footprints segmentation task. The research falls within the broader spectrum of research on understanding the fine-tuning process for geographic domain adaptation in image analysis validation (Rainio, Teuho, & Klén, 2024; Maier-Hein et al., 2024).

Data and Methodology fAIr is a software service that performs semantic segmentation to detect building footprints from openly available local satellite and UAV imagery at high resolution (cm) (OpenAerialMap website, 2017). In computer vision (CV), semantic segmentation is the task of segmenting an image into semantic meaningful classes, which is performed with convolutional neural networks (CNNs) architectures (Hoeser & Kuenzer, 2020). The deep learning CNN model used in fAIr is called RAMP (Replicable AI for MicroPlanning), and its architecture originates from an Eff-UNet model (RAMP model card, 2020; Baheti, Innani, Gajre, & Talbar, 2020). With the aim to analyse the current validation accuracy performance and compare against other metrics, an initial literature review on fine-tuning processes for geographic domain adaptation in image analysis validation was performed, which led to outlining a list of candidates for validation metrics (Reinke et al., 2024; Maier-Hein et al., 2024). In parallel, manual labelling on selected areas of interest (AoI) was carried out on a data set of sixteen urban regions, chosen as the most representative of different grades of urbanity, density, regional characteristics, roof cover types, etc. The pre-processing of the AoI images was performed for each urban region through fAIr-dev website (fAIr website, 2022), and produced 256x256 georeferenced tiles for both the original RGB images and labelled masks (see example in Figure 3). Then, the ML training was run on all sixteen training datasets using an Nvidia Tesla T4 GPU, for different batch sizes, epochs, and zoom levels.

Preliminary results Figure 2 shows an example of outcomes of the ML training for four of the urban regions with five types of validation metric (categorical accuracy, precision, recall, IoU - intersection over union, and F1 Score) and categorical cross entropy as the loss function. The analysis of the research results is still a work in progress. The performance of IoU, the suggested metric for this type of image analysis problem (Reinke et al., 2024), will be assessed against the currently used metric, categorical accuracy, and other commonly used metrics. As a preliminary example of future analysis, Figure 3 shows how categorical accuracy scores compare with other validation metrics at their final value, i.e. the value at the epoch at which the checkpoints are saved, which currently is set at the maximum for the validation categorical accuracy (early stopping). The different characteristics of the urban regions are also going to be assessed against the performance of the current model. From empirical tests, it is expected that the model has lower accuracy in more dense areas. Fig 4 seems to confirms that, for a specific epoch count (here 20 is shown as an example) and batch size, the performance is higher in urban regions that are more sparse and that present a grid disposition, compared to more dense regions, for all metrics except for Recall. The same figure, but for roof types, suggests that the model performance is lower for cement roof types, as opposed to mixed cover, metal and shingles covers. In terms of urban "type", the results suggest that the current model performs better in the refugee camps regions, and worse peri-urban regions, with higher variability among the validation metrics for semirural and highly urban regions. Further analysis on the statistical significance of these preliminary patterns is under way, with the possibility to extend the research to other urban regions.

Conclusions To avoid overfitting of the training dataset, early stopping is in place, thus it is important to choose the optimal metric with respect to how fAIr performs, and the research is going to point out the statistical significance of different choices for the model performance and for different urban characteristics. Further research will concentrate on using other factors to help drive the fine-tuning process, like loss regularisation, and will extend the analysis to the prediction performance.