Law Enforcement

Fighting the data mountain with LangChain: how to deal with the rising disclosure problem

Dr Brad Pearce
Director & Co-Founder, Polygeist

With the vast amount of media that is consumed and generated by each person, each day, the investigation of digital evidence is becoming increasingly challenging for police.

Disclosure happens in all criminal cases. The police, who investigate crimes and gather evidence, have an obligation to disclose to the defence any material they hold that they consider ‘relevant’ to the case. This means that they have a duty to disclose anything which could potentially undermine the prosecution case or assist the case for the defence. This also impacts the Crown Prosecution Service (CPS) who make decisions about whether or not cases will go to trial. Indeed, a single message which casts significant doubt on the case can make it collapse if it surfaces later during the trial; for example, this happened in the high profile case of R vs Allan in 2017, resulting in a joint review of the disclosure process by CPS and the Metropolitan Police.

The mountain of digital evidence that can exist on a single smartphone is staggering: there can be hundreds of messages between the accused and a victim, and many more between others which could provide crucial evidence. What is more, this crucial evidence might not just be on the device itself; the COPO initiative enables law enforcement officers and prosecutors to access and obtain stored electronic content directly from service providers (such as social media websites), located or operating outside the UK, to support criminal investigations and prosecutions. This means that those officers need to assess the possibility that fragments of evidence may exist, and have a duty to thoroughly investigate those sources.

This task has steadily moved from one that would be undertaken solely by an investigator, to one requiring the collaboration of digital forensics experts - and is rapidly becoming something which cannot be completed in a timely fashion even by a team of qualified experts. The extreme cost to prosecute, and the possibility of a single rogue message slipping through the net means that as few as 2% of sexual assault claims get prosecuted.

Is there a better way?

By now we have all become aware of ChatGPT, the technology able to quickly generate summaries of text, such as emails and books. Indeed, in March 2023, the amount of human-produced text since the dawn of humanity was eclipsed by that of ChatGPT - there is now more generated text than human written text. It is now possible to generate hundreds of thousands of coherent messages, across thousands of threads, on dozens of messaging apps, at the flick of a switch, making it almost impossible to triage digital devices by hand if such a ‘message-bomb’ has been deployed.

However, this Large Language Model (LLM) technology has some surprising benefits. It can be used to find messages of a certain sentiment, or find messages that are ‘like’ an exemplar. It can even find messages where the users are speaking in coded language.

This technology is not for the digital elites either, it can be used by anyone for free. LangChain is a Python library for Natural Language Processing (NLP) applications, using LLMs. It can not only perform these search tasks, but also converse with an investigator like the ChatGPT chatbot. It is capable of this because LangChain can be used to create a content database to search. It provides memory components and can manipulate previous chat messages and incorporate them into chains.

This is of huge benefit to not just disclosure within one investigation, but across many. For example, in the organised crime context, LangChain can be used to do contact prediction, showing who knows who, and even what sentiment they have towards each other.

This technology is going to change digital forensics, and the investigation of many types of crime. Most serious crime includes communication, and these tools are like having an army of super human evidence triagers at your fingertips.

First published: March 2023

« Off-Course to Tip-Off. Automatically Detecting High Risk Patients using Real-Time Patient Record Data

Locating Videos Using Electromagnetic Interference »