Insight: Advances in Hashing for Counterterrorism

Insight: Advances in Hashing for Counterterrorism
29 March 2023 GIFCT
In Insight, News

Advances in Hashing for Counterterrorism

The following is a staff insight from GIFCT’s Director of Technology, Tom Thorley. In this role, Tom leads the organization’s efforts to deliver cross-platform technical solutions for GIFCT members.

As terrorists and violent extremists continue to evolve their tactics to exploit digital platforms, we advance our cross-platform efforts in order to counter them. One key cross-platform solution has been the sharing of hashes. “Hashing” has been used to combat online harms for years and is now used in many tech platforms’ efforts to moderate content related to terrorism and violent extremism, along with child sexual abuse material, non-consensual image sharing, and other harms. 

GIFCT’s hash-sharing database is currently our leading cross-platform technical solution as we advance our mission to prevent terrorists and violent extremists from exploiting digital platforms. This database allows GIFCT member companies to identify if and where terrorists or violent extremists are attempting to exploit their respective platforms by sharing content to promote, recruit, and incite. For this database to continue to be an effective solution, we must continue to enhance its technology and the terrorist and violent extremist content it can address to keep pace with the dynamic and adversarial online threat landscape. Here is the latest on how GIFCT is approaching this essential work in support of our member companies.

What Is Hashing?

Hashes are numerical representations of original content and cannot be easily reverse-engineered to recreate the image, video, or other source content they represent. These hashes can then be used by companies to see if content on their platform matches the hash, surfacing it to be reviewed against their respective policies and terms of service. 

Several hashing algorithms are used in GIFCT’s hash-sharing database to support members with different processes and methods for surfacing and reviewing content. These algorithms can be grouped into two kinds:

  1. Cryptographic hashes (e.g. MD5)
  2. Locality-sensitive hashes (e.g. PDQ, PhotoDNA)

Cryptographic hashes match only the exact file that was hashed. They have very high precision (this exact file only) but low recall (any change at all to the file means the hashes will be totally different). Typically, these hashes are fast to compute, small to store, and fast to find matches against. 

Locality-sensitive hashes, however, produce hashes that have somewhat lower precision (which can be adjusted depending on how they are used) but higher recall (more variations of a specific item of content will match to that hash). Visually similar content creates hashes that are mathematically close to each other. These algorithms are typically more resource-intensive to produce, store, and use than cryptographic hashes.

So a combination of different types of hashes can give us a good range of precision, recall, and efficiency in how we identify terrorist and violent extremist content. These hashes allow GIFCT members to quickly identify visually similar content on their own platform that has been removed by another member, enabling them to review such content to see if it breaches their terms and conditions (without sharing any user data between companies).

GIFCT’s Hash-Sharing Database

Hashes of identified terrorist and violent extremist produced content that meet GIFCT’s taxonomy are added to GIFCT’s hash-sharing database; a shared, safe, and secure industry database of hashes available to GIFCT members. When GIFCT members review the content identified by hash-matching, they also have the option to give feedback in the database about that hash, telling other members whether they agree or disagree that any one hash relates to terrorist activity and its severity. 

GIFCT respects that each member has different policies, corporate purposes, and terms and conditions. As a result, there is no one-size-fits-all approach to how companies use hashes to support their platforms or how member companies apply their policies to the material surfaced from matches against hashes in the hash-sharing database, though GIFCT provides assistance through definitional frameworks and expert resources.

Our Latest Technical Advances

We continually evolve the hash-sharing database so that a greater range of different digital platforms can exchange signals of known terrorist and violent extremist content. This ensures that the database provides a high-quality signal to identify where terrorist and violent extremist activity may be taking place on member platforms that violate their policies and terms of service. 

Over the last year, we have expanded beyond hashes for images and videos to also hash PDFs of attacker manifestos and branded terrorist and violent extremist content. This is an important evolution, recognizing that online content is not just limited to images and videos. Instead, we continue to see terrorists and violent extremists in particular attempt to spread radicalizing material through URL links to 3rd-party platforms and through PDFs of curated and self-published propaganda. GIFCT’s hash-sharing database now contains approximately 370,000 unique and distinct items relating to approximately 280,000 visually distinct images, 90,000 visually distinct videos, and 200 textually distinct items related to PDFs. 

More specific to the technology used, we’ve now advanced beyond locality-sensitive hashes for images (PDQ) and videos (TMK) to develop a process for hashing PDFs based on Text Locality Sensitive Hashing (TLSH), which produces hashes of text within PDFs that can be compared against hashes of other text to assess similarity. While this is an important area of progress, hashing is one imperfect solution to the problem of terrorist content online. GIFCT is constantly looking to make it a more perfect solution while also investing in areas like AI with faculty.ai, better integration with our systems using Hasher-Matcher-Actioner (HMA) with Meta, and better content moderation triage processes working with Jigsaw and Tech Against Terrorism

To improve hashing solutions across harm types, more investment and development are needed to ensure that these systems are as effective, efficient, and supportive of human rights as possible. 

How  Can We Measure Our Effectiveness?

To me, effectiveness means many different things, but ultimately comes down to the value the cross-platform technologies provide to our members, balanced with how much it costs to run these systems. 

We can think of value in a number of different ways. The first being the terrorist and violent extremist content we can address. This is why we have expanded our taxonomy to include attacker manifestos and branded terrorist content. This is also why we have been expanding our capabilities to better address a broader range of content formats that are used by violent extremists, including PDFs and text. At present, audio content remains a gap that needs to be addressed. 

The second is the precision and recall of the technology that we are using. This is why we constantly optimize our system and develop new algorithms or adopt those developed by our members, such as the VPDQ algorithm released last year by Meta. 

GIFCT’s role is to develop cross-platform solutions that allow companies to participate and use regardless of their size. New technological developments may have costs that can be more accessible to big companies but out of reach for smaller companies. When it comes to the cost of the hash-sharing database, our primary concern is that the system is both used and is usable by platforms of all sizes. The database must not require a level of resources to generate and store hashes that make it inaccessible to certain companies, and it must be reliable in swiftly matching to and surfacing harmful content for review. 

This year, GIFCT is investing in making the hash-sharing database easier to integrate with and easier to use the full range of features it offers. The usability and accessibility of the hash-sharing database can be further enhanced by ensuring that the technology, best practices, and standards developed are not just confined to addressing terrorism and violent extremism. To make these systems the most impactful, we should be aligning approaches across harm types, making it simpler and more accessible for companies to adopt hash-sharing systems offered to address different needs.

When we innovate and enhance the technology of the database, we look to find the right balance to ensure an effective and sustainable system for all our member companies. Finding this balance requires looking at all the solutions available, such as Google’s Vision AI, Amazon’s Rekognition, Microsoft Azure’s PhotoDNA Cloud Service, Videntifier or Pex’s Attribution Engine as well as investing in our own research and development. Through these processes to review options, make iterations, and adopt new technologies, we can select the most efficient solutions to help our members, regardless of their own sizes and levels of resources, and work to prevent terrorist and violent extremist exploitation of their platforms.

Measuring and understanding the value the hash-sharing database provides also requires the ability to monitor and measure its impact on preventing further terrorist and violent extremist exploitation of digital platforms and upholding human rights. This work for us started with reporting on the number of hashes in the database, but how we understand value requires more than that. Understanding the value of the hash-sharing database certainly includes being able to monitor and measure its impact on GIFCT’s mission and on human rights more broadly. At the end of 2022, we published our most recent transparency report that sought to meaningfully enhance our transparency of GIFCT’s hash-sharing database, giving statistics on the breakdown of hashes by the different categories in our taxonomy. But there is more to do here, and we will continue to make this data more available and usable as we seek to measure our impact. 

Our Commitment

Embedding human rights into policies and practices is a key commitment across all of our efforts. For our hash-sharing database, we made significant progress in this area last year. As our 2022 transparency report shows, last year we completed the transition of management and oversight of the hash-sharing database to GIFCT’s team, enhanced metrics and insights on the latest composition of content corresponding to hashes in the database, delivered findings from GIFCT’s first sampling and review exercise of hashes to ensure the quality and reliability of the hash-sharing database for members, and published our code of conduct

This continues to be a work in progress, and further enhancements to our transparency, governance, and oversight of the database are major goals for GIFCT this year. Processes to secure and anonymize data are a core safeguard to ensure the privacy and security of the end users uploading, viewing, and appearing in the content. Indications that some hashing algorithms are vulnerable to reverse engineering using AI systems only underscore the importance of continued improvement in these areas. Balancing privacy and security with the need to report on what is in the database and how it is used (ensuring that the impact on freedom of expression can be assessed) requires careful legal analysis as well as new research methodologies and metrics to be developed. GIFCT continues to invest in these areas as well. 

Ultimately, hashing is one tool in our fight against terrorism and violent extremism; it is the core of GIFCT’s current operational capability. We have been improving the system, and it must continue to evolve with the dynamic threat. As we build on the existing system, understanding its impact and being transparent about its functionality will enable us to ensure that we live up to our foundational principles of respect for human rights and deliver on our mission of preventing terrorists and violent extremists from exploiting digital platforms.