DarkBERT: The cybersecurity soldier on the dark web

What we see today constitutes only 4% of the total internet. The rest belongs to the deep and dark web.
60% of data on the dark web (60% of 75,000TB), if leaked, can hurt big companies.
Almost 27.48 million credentials present on the dark web belong to employees of Fortune 1000 companies.
Almost 56.8% of the content on the dark web is illegal.
There are more than 8 million users on the 10 most active dark web hacking forums, and the numbers have only been increasing since the pandemic. This is thanks to the lockdown and, most importantly, the complex layers of the network that mask the IP addresses of users.

So what does all this mean? In one word: trouble!

"Ultimate excellence lies not in winning every battle, but in defeating the enemy without ever fighting," said Sun Tzu in the Art of War.

On the surface level web, we have tools—SIEM solutions integrated with UEBA and SOAR capabilities—to help us achieve that excellence. But what happens on the dark web, where anonymous users exchange sensitive data in encrypted languages and sell stolen PII, forged data from big companies, malware, botnets, and exploit kits?

Detecting and tracking these activities seemed close to impossible until a few days back when a few researchers from South Korea came together to build a generative AI exclusively for the dark web called DarkBERT.

To understand DarkBERT, we first need to understand its predecessor. So, let us rewind a bit and dig a little deeper.

Natural language processing

Natural language processing (NLP) is the branch of computer science that humanizes computers, i.e., it helps computer programs understand and interpret the human language, including syntax, semantics, and lexicons. We come across NLP programs on a daily basis: autocorrect, voice assistants, and chatbots. So, how does NLP understand, interpret, and communicate in human language?

For this, it uses the help of language models. There are many language model architectures that exist today—n-grams, feedforward neural networks, and recurring neural networks to name a few. Transformer is yet another example of an architecture based on which many language models have been created. The Bidirectional Encoder Representation from Transformers (BERT) and the Generative Pre-trained Transformer (GPT) are two popular types of transformer architecture.

The BERT family of models

BERT is a pre-trained language based on the transformer architecture. Pre-trained languages are languages that are trained on huge chunks of data. BERT was first introduced by Google in 2018 and pretty soon became a revolutionary language model. This is because, unlike its predecessors, it could interpret data bidirectionally, i.e., it had a better understanding of sentences and the context in which they were used. BERT uses the masked language model (MLM) technique to set itself apart from other language models. MLM masks random words of a sentence and, based on the context as well as the words surrounding the sentence, BERT completes the sentence by inserting the right word.

Robustly Optimized BERT Approach

Based on the BERT framework, researchers from Facebook created a model called Robustly Optimized BERT Approach (RoBERTa). This model was better than BERT because it was trained on a data set almost 10 times larger. Moreover, RoBERTa was equipped with a better MLM technique, as the training involved masking text multiple times rather than just once. The training also included next sentence prediction, which enabled RoBERTa to predict if two sentences actually went together or not.

Though both these models did extremely well on the surface web, they could never decode the language of the dark web as they were never trained for this purpose. While conversations on the surface web are in human language, the dark web uses encrypted or code languages to exchange anonymous messages. It was RoBERTa's model that served as the base architecture during the development of DarkBERT.

So, what is DarkBERT?

DarkBERT is a pre-trained language model that has been trained on 2.2TB of data collected from multiple websites on Tor. The websites from which the data was collected contained information that was sensitive and could be harmful if exposed. To avoid any exposure, the data used for training was filtered, balanced categorically (an equal amount of data from the different categories was selected), de-duplicated (repetitive information was removed), and pre-processed (the data was cleaned to avoid incomplete or sensitive information from being used for training) by the researchers.

What can DarkBERT do?

DarkBERT can automate the process of detecting threats on the dark web.

Its current capabilities include:

Classifying dark webpages to decide which pages to focus more on.
Identifying a threat from communication between members of groups, i.e., it can detect a possible attack on an organization from the messages and data exchanged on various forums.
Identifying a data breach discussion thread with the help of various keywords.
Detecting threat-related keywords. DarkBERT uses MLM techniques to identify the right word contextually where fill mask functions are applied. (The fill mask model refers to masking certain words in a sentence to ensure AIs are precise in their prediction.)

What can DarkBERT mean for organizations?

If a tool like DarkBERT existed a few years back, then the WannaCry ransomware, which caused a staggering loss of $4 billion, would have been just another ransomware picked up by DarkBERT while its attack was being plotted in the deep corners of the web.

With data increasingly being saved on digital devices or on cloud platforms, there is a greater need for security now more than ever before. The introduction of DarkBERT will make organizations want to incorporate dark web audits for their due diligence.

In the future, security analytics and SIEM vendors will begin to offer solutions that can leverage DarkBERT. Here are some possible benefits of a SIEM solution that can do this:

The solution could identify the existence of sensitive company data on the dark web. This could be done by providing keywords into a generative AI such as DarkBERT.
An alert could be set up on the cybersecurity tool for every time DarkBERT detects a threat, data breach, ransomware, and company espionage on a hacking forum.
Red teams could use DarkBERT to collect information associated with the sale of malware or network vulnerabilities, and to conduct penetration testing.
Monthly reports could be sent from DarkBERT to the SIEM solution to analyze and detect threats proactively.

It will be hard to determine what the future holds for cybersecurity tools in an industry as dynamic as information technology. But if there is one thing that remains constant for cybersecurity tools, it is that they need to stay ahead of their enemy at all times. With the introduction of an AI like DarkBERT, the future is full of possibilities, and what may come out of it will only be for the better.

With DarkBERT, the chances of you winning without fighting increase by leaps and bounds.

DarkBERT: The cybersecurity soldier on the dark web

On this page

Natural language processing

The BERT family of models

Robustly Optimized BERT Approach

So, what is DarkBERT?

What can DarkBERT do?

What can DarkBERT mean for organizations?

Related solutions

DarkBERT: The cybersecurity soldier on the dark web

On this page

Natural language processing

The BERT family of models

Robustly Optimized BERT Approach

So, what is DarkBERT?

What can DarkBERT do?

What can DarkBERT mean for organizations?

Related solutions

Related Posts