Flood Risk Analysis

Assessing machine learning methods for flood risk identification in Buncombe County, North Carolina

Overview and Methods: This mini-project aimed to leverage different machine learning algorithms to predict flood risk areas in Buncombe County, NC. With more frequent and severe storms and flood events, understanding the potential for flood risk is crucially important. We combined several machine learning techniques to achieve our goal of predicting flood risk areas. As a feasibility study, we chose to examine conditions in Buncombe County around September of 2024 due to the recent and tragic flood events. Based on existing literature, we selected the following variables to include in our models: Topographic Wetness Index (TWI), distance to streams (DTS), Normalized Difference Vegetation Index (NDVI) from Landsat 9 OLI-2, National Landcover Database (NLCD) data, and social vulnerabilty from the U.S. Census Bureau. For our flood risk layer, we used the National Flood Hazard Layer (NFHL) 100-year floodplain data from FEMA, and decided to incorporate a 1 km buffer area as well. For modeling, we incorporated a random forest classification, support vector machine classification, and conducted a principal component analysis. Our mini-project is still ongoing; however, we have produced some preliminary results, which I am happy to share below!

Results: The initial random forest model performed well on "no flood" pixels (assigned as 0), but did not accurately classify "yes flood" pixel presence (assigned as 1). The out-of-bag error for this first iteration was 2.25%, and the confusion matrix revealed that very few "yes flood" pixels were properly classified. Correspondingly, the most impactful variable was distance to streams.

Upon further investigation, the proportion of "no flood" pixels was much higher than "yes flood" pixels. Thus, we employed several dataset balancing techniques to account for this imbalance. The first method involved bringing down the number of 0 pixels, so we randomly selected rows to reduce. This resulted in an even number of 0 and 1 rows within our dataframe. This proved to significantly improve the random forest performance. The second method that we attempted included assigning weights to the predictor variables. Pixels with a value of 1 were weighted more strongly than pixels that contained a 0. This, however, did not yield very noticable improvements in model performance. We ultimately moved forward with the reduction of randomly selected 0 pixels. The result of the balanced predicted flood risk can be seen below.

Discussion and Conclusion: More results coming soon!

Tools Used: R, QGIS, GEE

Keywords: Flood Risk, Random Forest, Machine Learning, Remote Sensing, GIS

Project Contributors: Katie Miller (SVM, data acquisition, code refinement), Truman Anarella (PCA, data acquisition, code refinement), and Maya Hall (random forest, data acquisition, code refinement)

← Back to Projects