Address and Transaction Type Prediction
Bitcoin addresses may be involved in receiving coins due to ransomware and darknet market payments. Malicious entities have used three money-laundering regimes on blockchain since 2009 with increasing sophistication: peeling chains, coin-mixing and shapeshifting. In this task, the goal is to identify which bitcoin addresses are owned and maintained for illicit gains. Some dark (known to be illicit) addresses have been published in academic works.
Ransomware Classification Dataset: Bitcoinheist
We provide a dataset with 24,486 addresses from 27 ransomware families. The dataset contains ten features extracted from the Bitcoin transaction network for both dark and ordinary addresses.
Feature Description
1-address: String. Bitcoin address, 2-year: Integer. Year, 3-day: Integer. Day of the year. 1 is the first day, 365 is the last day, 4-length: Integer, 5-weight: Float, 6- count: Integer, 7-looped: Integer, 8-neighbors: Integer, 9-income: Integer. Satoshi amount (1 bitcoin = 100 million satoshis), 10-label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker, etc) or white (i.e., not known to be ransomware).
Our graph features are designed to quantify specific transaction patterns. Loop is intended to count how many transactions i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency. Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address. Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions' output?) of transactions. Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.
Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware.
Sample Dataset
The sample data set contains 100 data points.
Full Dataset
Data Set Characteristics: Multivariate, Time-Series
Task 1: Classification (binary or multi-label) - Given the features and labels of white addresses and known ransomware addresses, predict which other addresses are undisclosed ransomware payment addresses (that receive ransom payments).In a similar vein, one can also predict which undisclosed addresses are co-owned by the owners of known ransomware addresses. In the binary classification case, we update labels of addresses from all ransomware families as virus.
Task 2: Temporal prediction - Use address data until time t to train a model to predict address labels after t.
Task 3: Classification with imbalanced data - Given the features and labels of white addresses and known ransomware addresses, predict which other addresses are undisclosed ransomware payment addresses (that receive ransom payments). Ransomware addresses are quite few compared to 800K daily Bitcoin addresses
Challenge: The dataset is quite clean and has no missing values. The dataset contains address features that we have extracted from daily Bitcoin transaction graphs, however we sample 1000 daily white addresses only. Bitcoin contains approximately 800K daily addresses, hence the full network is quite large. If more addresses need to be analyzed, you must load the whole transaction graph and work on a graph over more than 10 years (2009-202x) to extract your own features.
Number of instances: 2916697
Number of attributes: 10
Classification target attribute: label
Missing data: none
The data set (compressed 110MB) contains 2916697 data points. Note that ordinary (white) Bitcoin addresses are capped (sampled) at 1K per day because Bitcoin has up to 800K addresses daily.
Cite Our Dataset:
@inproceedings{chartalistNeurips2022,
author = {Kiarash Shamsi and Yulia R. Gel and Murat Kantarcioglu and Cuneyt G. Akcora},
title = {Chartalist: Labeled Graph Datasets for UTXO and Account-based Blockchains},
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference
on Neural Information Processing Systems 2022, NeurIPS 2022, November 29-December
1, 2022, New Orleans, LA, USA},
pages = {1--14},
year = {2022},
url = {https://openreview.net/pdf?id=10iA3OowAV3}
}
Baseline
We approach the baseline detection task by extracting six features from the daily Bitcoin transaction network for each address. We have designed the graph features to quantify ransomware operators’ specific obfuscation patterns. Afterward, we employ tree bases methods, clustering, and naive similarity search on the feature matrix of all Bitcoin addresses. baseline models yield better recall than precision. Similar to our (proposed) topological data analysis method (TDA), which performs the best, the DBSCAN clustering algorithm can ignore data points in its model building; two of the best non-TDA results are delivered by DBSCAN models. In the best TDA models for each ransomware family, we predict 16.59 false positives for each true positive. In turn, this number is 27.44 for the best non-TDA models.
@inproceedings{DBLP:conf/ijcai/AkcoraLGK20,
author = {Cuneyt Gurcan Akcora and Yitao Li and Yulia R. Gel and Murat Kantarcioglu},
editor = {Christian Bessiere},
title = {BitcoinHeist: Topological Data Analysis for Ransomware Prediction on the Bitcoin Blockchain},
booktitle = {Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, {IJCAI} 2020},
pages = {4439--4445}, publisher = {ijcai.org}, year = {2020}
}