Skip to content

Analyzing Ethereum with Cryo

Posted on:September 23, 2023 at 10:05 PM

Cryo is a blockchain analysis tool built by Paradigm. Cryo is the easiest way to extract blockchain data for analysis. In this article, I install Cryo, download some Ethereum data, and analyze it with polars.

Table of Contents

Open Table of Contents

Introduction to ❄️🧊 Cryo 🧊❄️

From the Cryo readme

❄️🧊 cryo 🧊❄️

cryo is the easiest way to extract blockchain data to parquet, csv, json, or a python dataframe.

cryo is also extremely flexible, with many different options to control how data is extracted + filtered + formatted

cryo is an early WIP, please report bugs + feedback to the issue tracker

Storm Silvkoff gave an amazing guide to Cryo in his Rust x Ethereum day talk:

The YouTube video is only about 20 minutes long. Would recommend.

Cryo is built on the Rust programming language. You will need to install Cargo before using Cryo.

For the second part of this article, we will explore the data we have collected with Cryo. We will use Python libraries for this, so you will need Python installed.

Installing Cryo

The first step to use Cryo is to install it.

I chose to build from source because Cryo release 0.2.0 currently has some building issues when using cargo install.

To install:

git clone https://github.com/paradigmxyz/cryo
cd cryo
cargo install --path ./crates/cli

To test your installation run:

cryo -V

Downloading data

Cryo can download many different datasets:

In this article, we’ll download and analyze the blocks dataset:

cryo blocks <OTHER OPTIONS>

Data Source

Cryo needs an rpc url to extract blockchain data from. Chainlist is an RPC aggregator that collects the fastest free and open endpoints. Any of the http endpoints listed should work, but I chose https://eth.llamarpc.com/ from LlamaNodes

Our cryo command now looks like this:

cryo blocks --rpc https://eth.llamarpc.com <OTHER OPTIONS>

If you get an error like the following:

send error, try using a rate limit with --requests-per-second or limiting max concurrency with --max-concurrent-requests

you might try other rpcs.

A Note on RPCs

If you are using an online RPC, you will likely have worse performance than if you were running a local node like reth.

if you set up your own reth node you can get 10x that speed — Storm Silvkoff on Telgram

Data Directory

To keep our data separate from other files for analysis, I have created a .data directory. You must create this directory before running cryo.

Adding our data directory to the command:

cryo blocks --rpc https://eth.llamarpc.com -o ./.data/ <OTHER OPTIONS>

Additional Columns

The default blocks schema includes the following columns:

schema for blocks
─────────────────
- number: uint32
- hash: binary
- timestamp: uint32
- author: binary
- gas_used: uint32
- extra_data: binary
- base_fee_per_gas: uint64

but there are also other available fields:

other available columns: logs_bloom, transactions_root, size, state_root, parent_hash, receipts_root, total_difficulty, chain_id

Find this information for your dataset by running:

cryo <DATASET> --dry --rpc https://eth.llamarpc.com

For this analysis, I’m interested in the size column. We can use the -i flag in our command to tell cryo that we want size data:

cryo blocks --rpc https://eth.llamarpc.com -o ./.data/ -i size <OTHER OPTIONS>

Let’s do it!

Before running, we should specify specific blocks that we are interested in so we avoid downloading the entire blocks dataset (it’s massive.) Cryo downloads data in chunks of (default) 1000 blocks, so we’ll use the --align flag to “align block chunk boundaries to regular intervals.”

Our final command looks like this:

cryo blocks -b 18100000:18190000 -i size --rpc https://eth.llamarpc.com --align -o ./.data/

which downloads 90 thousand blocks from our node and stores them in the parquet format.

Running on my laptop took just 1 minute and 38 seconds!! (Still want to try running on a reth node though.)

Polars Analysis

The next step in this process is to analyze the data we have. We’ll use the polars DataFrame library to read the parquet files that we have downloaded.

I’ll use an interactive Python notebook (also known as a Jupyter Notebook) inside my VSCode development environment.

Paradigm provides an example notebook on their data website that I’ve used as a template for our analysis.

You can find the full notebook on Github, in this section I will discuss some of my findings:

Timestamps

In this section, I explore the timestamp data of blocks we downloaded:

# get all timestamps in np array
timestamps = scan_df().select(pl.col('timestamp')).collect(streaming=True).to_numpy()

# calculate time difference between blocks
time_diff = np.diff(timestamps, axis=0)
Average Block Time:  12.136534850387227
Standard Deviation of Block Time:  1.2814744196308057

Extra data aka Block Graffiti

Extra data is

An optional free, but max. 32-byte long space to conserve smart things for ethernity. :) — https://ethereum.stackexchange.com/a/2377

Many block builders use extra data to identify that they built the block.

# get total gas used by unique extra_data
result_df = scan_df().groupby('extra_data').agg(pl.col('gas_used').sum().alias('tot_gas_used')).collect(streaming=True)
sorted_result_df = result_df.sort('tot_gas_used', descending=True).head(10)
extra_data = sorted_result_df['extra_data'].to_numpy()

Gas

Next, I explored the base_fee of blocks over time. Gas prices, as defined in EIP-1559 include both the base_fee defined per block and a priority fee that is determined for every transaction by its sender. In this section, we analyze the base_fee to learn about gas changes over time.

# get base_fee_per_gas and timestamp, sort by timestamp
scan_df().select('base_fee_per_gas', 'timestamp').collect(streaming=True).sort('timestamp').to_numpy()

Another interesting gas-related data point we have is the gas_used in each block. Let’s use a bell curve to graph the gas_used by each block:

# get gas_used
res = scan_df().select('gas_used').collect(streaming=True).to_numpy()

# bell curve graph of gas_used
plt.figure(figsize=(20, 10))
plt.hist(res, bins=100)
plt.title('Gas Used')
plt.xlabel('Gas Used')
plt.ylabel('Count')
plt.show()

Beautiful.

Block Size

As mentioned previously, we also download the size of each block (in bytes.)

The previous distribution was pretty, let’s try that again:

# get size
res = scan_df().select('size').collect(streaming=True).to_numpy()

# bell curve graph of gas_used
plt.figure(figsize=(20, 10))
plt.hist(res, bins=150)
plt.title('Block Size')
plt.xlabel('Block Size (bytes)')
plt.ylabel('Count')
plt.show()

Less satisfying, but interesting all the same. Let’s try plotting over time:

I still can’t tell much from this graph, maybe a box and whiskers will be more informative?

Huh. Lots of outliers. Maybe we just need numbers?

# Print some summary statistics
print("Min: ", np.min(res[:, 0]))
print("Average: ", np.mean(res[:, 0]))
print("Median: ", np.median(res[:, 0]))
print("Std Dev: ", np.std(res[:, 0]))
print("Max: ", np.max(res[:, 0]))
Min:  1115                   # 1.115KB 
Average:  172033.70095555554 # 0.172MB
Median:  150470.0            # 0.15MB 
Std Dev:  125779.71706563157 # 0.126MB
Max:  2218857                # 2.2MB

Interesting.

Conclusion

We only explored surface-level data here. I really enjoy this kind of messing around with data. Running the entire notebook takes only a few seconds.

Having such easy access to analysis of complex data increases the likelihood that people will explore their data and uncover insights.

If you found this post helpful, please consider subscribing to my newsletter for future updates: