🧱🔗Making Sense of Ethereum Data for Analytics

On the Mark Data
5 min readFeb 13, 2022
Photo by Nenad Novaković on Unsplash

If you work in tech, you have probably heard about blockchain, crypto, and now web3. Some bemoan the existence of web3, while others describe it as the panacea for solving the world’s problems via decentralization. Regardless of whether you sit on the spectrum of web3 acceptance — if you are a data scientist, I highly encourage checking out blockchain data for your next project.

In this intro article, I will cover the following:

  1. What is the blockchain, and why should data scientists care?
  2. Making sense of Ethereum transaction logs.
  3. Ethereum blockchain data sources you can start using today.
  4. Links to some of my favorite blockchain analyses.

What is the blockchain, and why should data scientists care?

At a very high level, blockchains are ledgers of transactions utilizing cryptography that can only add information and thus can’t be changed (i.e., immutable). What separates blockchains from ledgers you find at banks is a concept called “decentralization” — where every computer connected to a respective blockchain must “agree” on the same state of the blockchain and subsequent data added to it. For the sake of brevity, this is a huge simplification of blockchains, but if you want to learn more, I highly recommend checking out the talk by Stanford professor Dan Boneh on the technology.

What makes blockchains a goldmine for data scientists is that decentralization requires every piece of data to be public and accessible. In other words… as a data scientist, you have access to a global ledger of transactions and metadata. Furthermore, decentralized projects require being open-source to be trusted, so one also has the code available to determine how the data is created and stored. I know of few real-world datasets that are this large and open to the public.

Making sense of Ethereum transaction logs.

There are multiple blockchains such as Bitcoin, Ethereum, and Polygon — but I suggest you first focus on Ethereum for the following reasons:

  1. Bitcoin is just pure financial transactions, while Ethereum allows for financial transactions and “smart contracts,” which are tokens that run the code powering many decentralized apps within web3.
  2. There are alternatives to Ethereum for smart contracts, like Polygon, but Ethereum has a huge lead in market share at the time in writing this.
  3. Much of the infrastructure to easily consume blockchain data has already been built on Ethereum.

One of those critical pieces of infrastructure is the website Etherscan.io where one can view any transaction on the Ethereum blockchain. Etherscan is also a great starting point to learn about the data you will soon analyze!

Below is a screenshot directly from Etherscan highlighting the recent transactions for Doodles, a popular NFT project:

Etherscan Logs for the Doodles NFT Project

Much more metadata is available, but the foundational data seen in the transactions above are enough for your first analysis.

Let’s start by reviewing a specific transaction for the Doodles project, which we can lookup by searching the following Txn Hash on Etherscan:

Txn Hash: 0x0d8f48605b4169ec557d4fb25733c41daeeae7dee01b2ee9d54095ffc5cd0164

Below I will provide a brief description of the value key and its respective value for the highlighted Txn Hash .

  • Txn Hash: Every transaction, regardless of the completion status, can be identified via its unique identifier known as the “transaction hash.”
    - 0x0d8f48605b4169ec557d4fb25733c41daeeae7dee01b2ee9d54095ffc5cd0164
  • Status: The state of a transaction that can either be “Success,” “Failed,” or “Pending.”
    - Success
  • Block: Transactions are grouped to form a “block” of data to add to the blockchain, where this value represents the block number to identify the respective block.
    - 14194406
  • Timestamp : The time in which the respective block number was “mined” in UTC format.
    - Feb-12–2022 11:58:34 PM +UTC
  • From: The unique id representing the wallet that initiated the blockchain transaction (note: a user may have multiple wallets).
    - 0xc6b65187f9e07b9e4b73a2262dc3bede7301398a
  • To: The unique id representing the wallet receiving the transaction.
    - 0x7be8076f4ea4a4ad08075c2508e481d6c946d12b
  • Value: The amount of Ether transferred for the respective transaction.
    - 17.249 Ether
  • Txn Fee: The Ether cost for processing the respective transaction, regardless of success or failure, where the total cost of a transaction is Value + Txn Fee if successful or just Txn Fee if failed.
    - 0.020267334668888444 Ether

Etherscan provides excellent documentation if you want to learn more about the life cycle of an Ethereum blockchain transaction.

Ethereum blockchain data sources you can start using today.

You can signup for a free version of the Etherscan API to start pulling Ethereum blockchain data, but there are easier ways to start if you know a little SQL.

Flipside Crypto

Flipside Crypto pulls data directly from multiple APIs to create numerous blockchain datasets housed within a Snowflake data warehouse. Once you signup for free, you have access to Velocity, a SQL query editor where you can access data and save it to a CSV for further analysis.

In addition, web3 companies partner with Flipside Crypto to offer “bounties” where you can submit analytics work to answer questions in return for crypto. It’s a great way to start learning about the types of questions important to web3 users and builders.

Dune Analytics

Like Flipside Crypto, Dune Analytics aggregates various blockchain data sources that users can access via a SQL query editor. In addition to the editor, Dune Analytics allows you to pass your SQL queries directly into their internal business intelligence tool to create publically shareable visualizations and dashboards. Here is an example of a dashboard exploring the NFT marketplace, OpenSea.

Links to some of my favorite blockchain analyses.

This article only touched on blockchain analytics at a high level to get you started, but people in the space are doing some innovative work. Below are three analyses that inspired me recently.

Crime and NFTs: Chainalysis Detects Significant Wash Trading and Some Money Laundering In this Emerging Asset Class: Researchers from Chainalysis using network analysis to identify transactions and users engaging in fraudulent activities.

FF Data Bites: Social Token Tags Network Analysis: ForeFront DAO and Omni Analytics Group collaborate to explore the overlap between tags describing various web3 social groups that use governance tokens.

WORMHOLE — REKT: An overview of how a hacker stole $326MM from the Solana blockchain, with a link to the Twitter thread where blockchain detectives reverse-engineered the attack and identified the hacker’s wallet.

Take a look at these fantastic examples, get inspired, and start analyzing!

--

--

On the Mark Data

Data engineering consulting, speaking, and content creation. Learn more at onthemarkdata.com.