Lejdi Prifti

0 %
Lejdi Prifti
Software Engineer
Web3 Developer
ML Practitioner
  • Residence:
  • City:
  • Email:
React & Angular
Machine Learning
Docker & Kubernetes
AWS & Cloud
Team Player
Time Management
  • Java, JavaScript, Python
  • AWS, Kubernetes, Azure
  • Bootstrap, Materialize
  • Stylus, Sass, Less
  • Blockchain, Ethereum, Solidity
  • React, React Native, Flutter
  • GIT knowledge
  • Machine Learning, Deep Learning

No products in the basket.

Ethereum Smart Contract Vulnerability Detection with RNNs

17. December 2023


In this article, I continue my research and I will present you Ethereum Smart Contract Vulnerability Detection with RNNs. In my previous article, I used CNNs to detect 5 types of vulnerabilities in the smart contract bytecode. You can read more here. However, in this article I will use RNNs instead and make a comparison between the two. Which of these two neural networks performs better in Ethereum bytecode vulnerability detection? Keep reading Ethereum Smart Contract Vulnerability Detection with RNNs to learn more.

What are RNNs?

Recurrent Neural Networks (RNNs) have had a profound impact on the field of machine learning, particularly in the domain of sequential data processing. Unlike CNNs, RNNs have the ability to capture and utilize information from previous time steps, making them well-suited for tasks involving sequences, such as natural language processing, speech recognition, time series analysis, and more.

The key feature of RNNs is their recurrent connections, which allow information to persist over time within the network. This makes them capable of learning patterns and dependencies in sequential data.

However, early RNN architectures faced challenges in effectively learning long-term dependencies, known as the vanishing and exploding gradient problems. This limitation hindered their ability to capture information from distant time steps.

Which is the dataset?

The dataset used to train, validate and test the model will the same as for the other model. To make the comparison fair, we must use the same dataset. It was retrieved from HuggingFace Hub created by Martina Rossini. It is split on training used to train the model, validation used to validate the model and test used to test the model.

Preparing the dataset

To prepare the dataset, we must go through the same steps as in the other article.

First of all, we load the small multilabel dataset from HuggingFace Hub using the datasets package . If you don’t have the datasets package installed, you can install it by simply running pip install datasets.

					from datasets import load_dataset

small_multilabel_dataset = load_dataset(path="mwritescode/slither-audited-smart-contracts", name="big-multilabel")

Great! Now that we have the dataset, it’s time to clean it. In ML, the most important job is to clean a dataset, rather than use it. The quality of data determines the quality of the model. 

Exploring the dataset, I noticed that there are some entries that do not have the bytecode available. They include only 0x and nothing else. We must get rid of these entries and do not feed our model with them.

Furthermore, if the contract does not have any vulnerability, consequently it is considered as safe. 

					def clean_data(dataset):
  cleaned_data = []
  for data in dataset:
    # clean the bytecode and the 4 output that represents if the contract is safe
    if (len(data['bytecode']) > 4):
      new_slither_output = []
      for output in data['slither']:
         if (output > 4):
           new_slither_output.append(output - 1)
         elif (output < 4):
  return cleaned_data

cleaned_training_data = clean_data(big_multilabel_dataset["train"])
cleaned_validation_data = clean_data(big_multilabel_dataset["validation"])
cleaned_test_data = clean_data(big_multilabel_dataset["test"])

len(cleaned_training_data), len(cleaned_validation_data), len(cleaned_test_data)

We split the text into strings of size 2 to mimic the length of Ethereum opcodes and create the training, validation and test features.

					def split_text_into_chars(text, length):
  return " ".join([text[i:i+length] for i in range(0, len(text), length)])
train_bytecode = [split_text_into_chars(data['bytecode'],2) for data in cleaned_training_data]
test_bytecode = [split_text_into_chars(data['bytecode'],2) for data in cleaned_test_data]
val_bytecode = [split_text_into_chars(data['bytecode'],2) for data in  cleaned_validation_data]

Next, we must create the labels. Firstly, we must transform slither data into one-hot encoded arrays that follow a certain logic. If the vulnerability 4 exists, then the 3rd index (arrays start at 0 indexing) must have a value of 1.

Secondly, we must create dictionaries out of our binary labels. 

					# Convert labels to binary vectors
import numpy as np

def labels_to_binary(y, num_labels):
  Converts the labels into binary format depending on the total number of labels, 
  for example: y = [1,4], num_labels = 5, y_binary = [0,1,0,0,1,0]
  y_binary = np.zeros((len(y), num_labels), dtype=float)
  for i, label_indices in enumerate(y):
      y_binary[i, label_indices] = 1
  return y_binary
num_classes = len(np.unique(np.concatenate(training_slither)))

train_labels_binary = labels_to_binary(training_slither, num_classes)
valid_labels_binary = labels_to_binary(validation_slither, num_classes)
test_labels_binary = labels_to_binary(test_slither, num_classes)

def transform_labels_to_dict(labels_binary):
  labels_dict = {}
  for index in range(num_classes):
    labels_dict[f'{index}'] = []

  for labels in labels_binary:
    for index, label in enumerate(labels):
  return labels_dict
validation_dict = transform_labels_to_dict(valid_labels_binary)
train_dict = transform_labels_to_dict(train_labels_binary)
test_dict = transform_labels_to_dict(test_labels_binary)

It’s time to create the datasets using Dataset from TensorFlow. Datasets are a combination of features and labels split into batches and prefetched in an optimized way.

					train_dataset = tf.data.Dataset.from_tensor_slices((train_bytecode, train_dict)).batch(32).prefetch(tf.data.AUTOTUNE)
validation_dataset = tf.data.Dataset.from_tensor_slices((val_bytecode, validation_dict)).batch(32).prefetch(tf.data.AUTOTUNE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_bytecode, test_dict)).batch(32).prefetch(tf.data.AUTOTUNE)

Create the TextVectorizer layer

The primary purpose of a text vectorizer layer is to transform text data into a numerical representation that can be fed into a neural network for training and prediction. This transformation is very important because neural networks operate on numerical data, and converting text into vectors enables the model to learn patterns, relationships, and semantic meanings inherent in the textual data.

					text_vectorizer = tf.keras.layers.TextVectorization(


bytecode_vocab = text_vectorizer.get_vocabulary()
print(f"Number of different characters in vocab: {len(bytecode_vocab)}")
print(f"5 most common characters: {bytecode_vocab[:5]}")
print(f"5 least common characters: {bytecode_vocab[-5:]}")

Create the Embedding layer

Its primary purpose is to convert discrete tokenized representations of words or categorical variables into continuous vector spaces, known as embeddings

					embedding_layer = tf.keras.layers.Embedding(
    mask_zero=True, # Conv layers do not support masking but RNNs do

Create the model

In ML, the goal is to start small and eventually build more complex models. After a couple of experiments and research, I noticed that GRUs perform better than LSTMs. Additionally, stacking recurrent networks is better than using a simple big layer. RNNs need a quadratic number of weights in their layer size. As a result, stacking 2 or 3 smaller layers instead of a big one achieves better results. By stacking two layers of GRUs, I expect the information to be stored and the model to learn to detect certain patterns. 

					# Create input layer
inputs = layers.Input(shape=(1,), dtype=tf.string)

# Create vectorization layer
x = text_vectorizer(inputs)

# Create embedding layer
x = embedding_layer(x)

# Create the LSTM layer
x = layers.GRU(units = 64, activation='tanh', return_sequences=True)(x)
x = layers.GRU(units = 32, activation='tanh')(x)
x = layers.Dropout(rate=0.2)(x)
x = layers.Dense(32, activation='relu')(x)

# Create the output layer
outputs = []
for index in range(num_classes):
  output = layers.Dense(1, activation="sigmoid", name=f'{index}')(x)

model_1 = tf.keras.Model(inputs = inputs, outputs = outputs, name="model_1")

Compile the model

Since the model is built for binary classification tasks, and each output node represents a different class, binary cross entropy is the chosen loss function, and accuracy is the key performance indicator.

for index in range(num_classes):
  losses[f'{index}'] = "binary_crossentropy"
  metrics[f'{index}'] = ['accuracy']
model_1.compile(loss=losses, optimizer=tf.keras.optimizers.Adam(learning_rate=1e-03), metrics=metrics)

Fit the model

The model is fitted on the training dataset and validated on the validation dataset. It implies two callbacks. The first callback is the ReduceLROnPlateau, which monitors the validation loss and in case it has not improved for 5 epochs, it decreased the learning rate with a factor of 0.1. The second callback is the ModelCheckpoint, which monitors as well the validation loss and saves the model with the lowest validation loss. The predictions are then performed with the best model saved by the checkpoint.

					history_1 = model_1.fit(train_dataset,


During training, I noticed that the model was overfitting after certain timesteps. Among the steps I took to overcome overfitting are:

  • Reducing the number of units in the GRU layers
  • Increasing the size of the dataset
  • Reducing the learning rate


I used the test dataset to make predictions upon. It is a good evaluation method because it is a dataset that the model has never seen.

					model_1 = tf.keras.models.load_model(filepath="model_experiments/model_1")
model_1_preds_probs = model_1.predict(test_dataset)

model_1_preds = convert_preds_probs_to_preds(model_1_preds_probs)


Now that we have the predictions, we want to retrieve some metrics from the predictions made. Are them correct? What percentage of them is correct?

					def combine_results(y_true, y_pred):
      results = {}
      for index in range(num_classes):
        results[f'{index}'] = calculate_results(y_true=test_dict[f'{index}'], y_pred=model_1_preds[f'{index}'])
    return results

Make it pretty! Let’s display the results in a table using pandas.

					import pandas as pd
results= combine_results(y_true=test_dict, y_pred=model_1_preds)


I was able to get results with RNNs that were better to those obtained with CNNs by utilizing a straightforward and minimally curated model. RNNs demonstrate the idea of getting improved outcomes.

Buy Me A Coffee
Posted in Blockchain, Deep LearningTags:
Write a comment