One of the inherent concerns about crowdsourced crisis information is that the data is not statistically representative and hence “useless” for any serious kind of statistical analysis. But my colleague Christina Corbane and her team at the European Commission’s Joint Research Center (JRC) have come up with some interesting findings that prove otherwise. They used the reports mapped on the Ushahidi-Haiti platform to show that this crowdsourced data can help predict the spatial distribution of structural damage in Port-au-Prince. The results were presented at this year’s Crisis Mapping Conference (ICCM 2010).
The data on structural damage was obtained using very high resolution aerial imagery. Some 600 experts from 23 different countries joined the World Bank-UNOSAT-JRC team to assess the damage based on this imagery. This massive effort took two months to complete. In contrast, the crowdsourced reports on Ushahidi-Haiti were mapped in near-real time and could “hence represent an invaluable early indicator on the distribution and on the intensity of building damage.”
Corbane and her co-authors “focused on the area of Port-au-Prince (approximately 9 by 9 km) where a total of 1,645 messages have been reported and 161,281 individual buildings have been identified, each classified into one of the 5 different damage grades.” Since the focus of the study is the relationship between crowdsourced reports and the intensity of structural damage, only grades 4 and 5 (structures beyond repair) were taken into account. The result is a bivariate point pattern consisting of two variables: 1,645 crowdsourced reports and 33,800 damaged buildings (grades 4 and 5 combined).
The above graphic simply serves as an illustrative example of the possible relationships between simulated distributions of SMS and damaged buildings. The two figures below represent the actual spatial distribution of crowdsourced reports and damaged buildings according to the data. The figures show that both crowdsourced data and damage patterns are clustered even though the latter is more pronounced. This suggests that some kind of correlation exists between the two distributions.
Corbane and colleagues therefore used spatial point pattern process statistics to better understand and characterize the spatial structures of crowdsourced reports and building damage patterns. They used the Ripley’s K-function which is often considered “the most suitable and functional characteristic for analyzing point processes.” The results clearly demonstrate the existence of statistically significant correlation between the spatial patterns of crowdsourced data and building damages at “distances ranging between 1 and 3 to 4 km.”
The co-authors then used the marked Gibbs point process model to “derive the conditional intensity of building damage based on the pairwise interactions between SMS [crowdsourced reports] and building damages.” The resulting model was then used to compute the predicted damage intensity values, which is depicted below with the observed damage intensity.
The figures clearly show that the similarity between the patterns exhibited by the predictive model and the actual damage pattern is particularly strong. This visual inspection is confirmed by the computed correlation between the observed and predicted damage patterns shown below.
In sum, the results of this empirical study demonstrates the existence of a spatial dependence between crowdsourced data and damaged buildings. The results of the analysis also show how statistical interactions between the patterns of crowdsourced data and building damage can be used for modeling the intensity of structural damage to buildings.
These findings are rather stunning. Data collected using unbounded crowdsourcing (non-representative sampling) largely in the form of SMS from the disaster affected population in Port-au-Prince can predict, with surprisingly high accuracy and statistical significance, the location and extent of structural damage post-earthquake.
The World Bank-UNOSAT-JRC damage assessment took 600 experts 66 days to complete. The cost probably figured in the hundreds of millions of dollars. In contrast, Mission 4636 and Ushahidi-Haiti were both ad-hoc, volunteer-based projects and virtually all the crowdsourced reports used in the study were collected within 14 days of the earthquake (most within 10 days).
But what does this say about the quality/reliability of crowdsourced data? The authors don’t make this connection but I find the implications particularly interesting since the actual content of the 1,645 crowdsourced reports were not factored into the analysis, simply the GPS coordinates, i.e., the meta-data.