Video continues to be a powerful way to capture human rights abuses around the world. Videos posted to social media can be used to hold perpetrators of gross violations accountable. But video footage poses a “Big Data” challenge to human rights organizations. Two billion smartphone users means almost as many video cameras. This leads to massive amounts of visual content of both suffering and wrong-doing during conflict zones. Reviewing these videos manually is a very labor intensive, time consuming, expensive and often traumatic task. So my colleague Jay Aronson at CMU has been exploring how artificial intelligence and in particular machine learning might solve this challenge.
As Jay and team rightly note in a recent publication (PDF), “the dissemination of conflict and human rights related video has vastly outpaced the ability of researchers to keep up with it – particularly when immediate political action or rapid humanitarian response is required.” The consequences of this are similar to what I’ve observed in humanitarian aid: At some point (which will vary from organization to organization), time and resource limitations will necessitate an end to the collection, archiving, and analysis of user generated content unless the process can be automated.” In sum, information overload can “prevent human rights researchers from uncovering widely dispersed events taking place over long periods of time or large geographic areas that amount to systematic human rights violations.”
To take on this Big Data challenge, Jay and team have developed a new machine learning-based audio processing system that “enables both synchronization of multiple audio-rich videos of the same event, and discovery of specific sounds (such as wind, screaming, gunshots, airplane noise, music, and explosions) at the frame level within a video.” The system basically “creates a unique “soundprint” for each video in a collection, synchronizes videos that are recorded at the same time and location based on the pattern of these signatures, and also enables these signatures to be used to locate specific sounds precisely within a video. The use of this tool for synchronization ultimately provides a multi-perspectival view of a specific event, enabling more efficient event reconstruction and analysis by investigators.”
Synchronizing image features is far more complex than synchronizing sound. “When an object is occluded, poorly illuminated, or not visually distinct from the background, it cannot always be detected by computer vision systems. Further, while computer vision can provide investigators with confirmation that a particular video was shot from a particular location based on the similarity of the background physical environment, it is less adept at synchronizing multiple videos over time because it cannot recognize that a video might be capturing the same event from different angles or distances. In both cases, audio sensors function better so long as the relevant videos include reasonably good audio.”
Ukrainian human rights practitioners working with families of protestors killed during the 2013-2014 Euromaidan Protests recently approached Jay and company to analyze videos from those events. They wanted to “ locate every video available in their collection of the moments before, during, and just after a specific set of killings. They wanted to extract information from these videos, including visual depictions of these killings, whether the protesters in question were an immediate and direct threat to the security forces, plus any other information that could be used to corroborate or refute other forms of evidence or testimony available for their cases.”
Their plan had originally been to manually synchronize more than 65 hours of video footage from 520 videos taken during the morning of February 20, 2014. But after working full-time over several months, they were only able to stitch together about 4 hours of the total video using visual and audio cues in the recording.” So Jay and team used their system to make sense of the footage. They were able to automatically synchronize over 4 hours of the footage. The figure above shows an example of video clips synchronized by the system.
Users can also “select a segment within the video containing the event they are interested in (for example, a series of explosions in a plaza), and search in other videos for a similar segment that shows similar looking buildings or persons, or that contains a similar sounding noise. A user may for example select a shooting scene with a significant series of gunshots, and may search for segments with a similar sounding series of gunshots. This method increases the chances for finding video scenes of an event displaying different angles of the scene or parallel events.”
Jay and team are quick to emphasize that their system “does not eliminate human involvement in the process because machine learning systems provide probabilistic, not certain, results.” To be sure, “the synchronization of several videos is noisy and will likely include mistakes—this is precisely why human involvement in the process is crucial.”
I’ve been following Jay’s applied research for many years now and continue to be a fan of his approach given the overlap with my own work in the use of machine learning to make sense of the Big Data generated during major natural disasters. I wholeheartedly agree with Jay when he reflected during a recent call that the use of advanced techniques alone is not the answer. Effective cross-disciplinary collaboration between computer scientists and human rights (or humanitarian) practitioners is really hard but absolutely essential. This explains why I wrote this practical handbook on how to create effective collaboration and successful projects between computer scientists and humanitarian organizations.