UTXO address clustering aims at discovering which blockchain (e.g., Bitcoin, Litecoin, ZCash, Monero) addresses are owned by a blockchain user. Clustering is a widely researched topic in Data Science, however, on blockchains address clustering has more to do with developing heuristics to link addresses than applying a data clustering algorithm. The heuristics are derived from insights into the transaction creation process on blockchains or the address usage behavior of users.
In this task, the goal is to invent a new clustering heuristic for UTXO based blockchains. On Bitcoin, there are currently three well-known heuristics.
Co-spending address heuristic
The co-spending (also known as multi-input) heuristic uses transaction signing behavior in UTXO blockchains and posits that all input addresses of a transaction belong to the same user. For example, in the figure below, addresses a, b, and c belong to the same owner.
Transition address heuristic
This heuristic extends the co-spending heuristic and assumes that if a transaction has inputs from a and b, whereas another transaction has from a and c, b and c belong to the same user.
Addresses a, b, c, d, and e belong to the same user.
Change Address Heuristic
The heuristic posits that the one-time change (output) address— if one exists— is controlled by the same user as the input addresses.
The following four conditions must be met to identify a change address:
- the output address has not appeared in any previous transaction
- the transaction is not a coin generation
- there is no self-change address in the outputs
- all the other output addresses in the transaction have appeared in previous transactions
Task: Clustering - Develop new heuristics to identify which addresses are owned by the same entity.
Challenge:Clustering is prone to create many false positives because blockchain exchanges manage user addresses and use them as inputs to large transactions.
Target attribute: None
Full dataset: See Bitcoin transaction network data files.
Cite Our Dataset:
@inproceedings{chartalistNeurips2022,
author = {Kiarash Shamsi and Yulia R. Gel and Murat Kantarcioglu and Cuneyt G. Akcora},
title = {Chartalist: Labeled Graph Datasets for UTXO and Account-based Blockchains},
booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference
on Neural Information Processing Systems 2022, NeurIPS 2022, November 29-December
1, 2022, New Orleans, LA, USA},
pages = {1--14},
year = {2022},
url = {https://openreview.net/pdf?id=10iA3OowAV3}
}
Baseline (Ransomware Clustering)
We approach the baseline detection task by extracting six features from the daily Bitcoin transaction network for each address. We have designed the graph features to quantify ransomware operators’ specific obfuscation patterns. Afterward, we employ tree bases methods, clustering, and naive similarity search on the feature matrix of all Bitcoin addresses. baseline models yield better recall than precision. Similar to our (proposed) topological data analysis method (TDA), which performs the best, the DBSCAN clustering algorithm can ignore data points in its model building; two of the best non-TDA results are delivered by DBSCAN models. In the best TDA models for each ransomware family, we predict 16.59 false positives for each true positive. In turn, this number is 27.44 for the best non-TDA models.
@inproceedings{DBLP:conf/ijcai/AkcoraLGK20,
author = {Cuneyt Gurcan Akcora and Yitao Li and Yulia R. Gel and Murat Kantarcioglu},
editor = {Christian Bessiere},
title = {BitcoinHeist: Topological Data Analysis for Ransomware Prediction on the Bitcoin Blockchain},
booktitle = {Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, {IJCAI} 2020},
pages = {4439--4445}, publisher = {ijcai.org}, year = {2020}
}