• Quick Search:

Bitcoin Transaction Network Files

This dataset contains transaction edge files from block 0 to block 737900. Block times are given in the bitcoin_times.csv file (US Central time). The data is divided into ten splits. Each split's data is zipped together and contains 73600 block edge files of transactions that were mined in the blocks of that splitting scope. Each .Csv file contains the information for 100 blocks. The data was parsed by the Bitcoin-ETL library of Google (https://github.com/blockchain-etl/bitcoin-etl). Some transaction outputs do not follow the protocol, and get labeled as nonstandard by the ETL library. In such outputs the addresses cannot be parsed. We give them with the nonstandard tag such as nonstandard98a1069f93253f2dcf24e989c69c53447fa0870c.
Each line in the input edge file is tab-separated with the format:


blockNo\thash of transaction\tnumOfInputs\tfirst input address\tfirst input amount\tsecond input address\tsecond input amount\t(additional inputs, if exist)\r\n

Each line in the output edge file is tab-separated with the format:

blockNo\thash of transaction\tnumOfInputs\tfirst output address\tfirst output amount\tsecond output address\tsecond output amount\t(additional outputs, if exist)\r\n

Paris

Consider the Bitcoin graph in the figure above, where transactions are shown with rectangles and addresses are shown with circles. There are four transactions and 13 addresses. This graph would be given in two files: blockID_in.csv and blockID_out.csv. Below, we show the content of each file. Note that we list previous (unknown) transactions with t_x* notation (For example t_x1, t_x2 are not shown in the network figure because they have appeared in the earlier blocks).

-- blockID_in.csv
BlockHeightOft_1 HashOft_1 2 a1 1 a2 1
BlockHeightOft_2 HashOft_2 3 a3 2 a4 1 a5 1
BlockHeightOft_3 HashOft_3 1 a7 0.8
BlockHeightOft_4 HashOft_4 2 a11 0.3 a8 3.8

-- blockID_out.csv
BlockHeightOft_1	HashOft_1	2	a_6	10^8	a_7	0.8^0.8
BlockHeightOft_2	HashOft_2	1	a_8	3.8*10^8
BlockHeightOft_3	HashOft_3	3	a_9	0.2*10^8	a_10	0.2*10^8	a_11	0.3*10^8
BlockHeightOft_4	HashOft_4	2	a_12	3.7*10^8	a_13	0.3*10^8

Sample Dataset

Use one of the block splits as sample data.

Full Dataset

Block Range Size Link
Block 0 to 73600 IN: 2.6 MB
OUT: 7.1 MB
Input / Output
Block 73700 to 147300 IN: 119.9 MB
OUT: 154.8 MB
Input / Output
Block 147400 to 221000 IN: 807.7 MB
OUT: 876.1 MB
Input / Output
Block 221100 to 294700 IN: 1.9 GB
OUT: 2.2 GB
Input / Output
Block 294800 to 368400 IN: 4.1 GB
OUT: 4.7 GB
Input / Output
Block 368500 to 442100 IN: 8.9 GB
OUT: 9.9 GB
Input / Output
Block 442200 to 515800 IN: 12.8 GB
OUT: 13.7 GB
Input / Output
Block 515900 to 589500 IN: 12.7 GB
OUT: 13.7 GB
Input / Output
Block 589600 to 663200 IN: 15.2 GB
OUT: 16.2 GB
Input / Output
Block 663300 to 736900 IN: 15.4 GB
OUT: 16.6 GB
Input / Output
Block 737000 to 737900 IN: 167.6 MB
OUT: 187.6 MB
Input / Output

Cite Our Dataset:

	@inproceedings{chartalistNeurips2022,
  author    = {Kiarash Shamsi and Yulia R. Gel and  Murat Kantarcioglu and Cuneyt G. Akcora},
  title     = {Chartalist: Labeled Graph Datasets for UTXO and Account-based Blockchains},
  booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference
               on Neural Information Processing Systems 2022, NeurIPS 2022, November 29-December
               1, 2022, New Orleans, LA, USA},
  pages     = {1--14},
  year      = {2022},
  url       = {https://openreview.net/pdf?id=10iA3OowAV3}
  }

Temporal Nature

The temporal aspect is implied and vital in Blockchains in almost all tasks. For this reason, all our data is tagged with temporal information.

Price prediction models have found past price to be the most informative attribute. Even in time agnostic applications, such as network core decomposition, Blockchain researchers divide the transaction network into 24-hour snapshots (as the entire network is too big) and study them in isolation. In another example, malicious actors start using a ransomware money laundering pattern in time, and the ML models should learn the origin of the model and apply it in future cases. In this sense, blockchains are the most important temporal data source, and many models, such as time series analysis and anomaly detection can benefit from the availability of Chartalist data.

In Bitcoin data, we share either the UNIX time stamp or the US Central time of the data.