Wednesday, September 28, 2016

Analyze Bitcoin Blockchain Data Using Google BigQuery and Datalab





In this article, we explore what one may learn from the information extracted from Bitcoin Blockchain using Google Cloud Platform BigQuery and Datalab. The first topic is about the anonymity of Bitcoin and the second is comparing the return on Bitcoin to Gold.


The blockchain data used is extracted and processed with heuristic algorithms to identify clustering of transactions that belong to same entities through graph analysis: see detail https://github.com/schloerke/Bitcoin-Transaction-Network-Extraction for more detail.


A blockchain is a distributed ledger,  the data of the entire network will be downloaded by running on a full node. The current data size is about 160G. It will take a few days of running a full mining node before getting blockchain data up to date. The information on Bitcoin pricing (or exchange rate)  is still sparse and many are in different currencies, therefore the sourcing and normalizing data is a challenge.An example source is http://api.bitcoincharts.com/v1/csv/.


Google Cloud Storage and BigQuery is the main ETL tool and storage: it enables you to start exploratory analysis right away, without the need to setup and maintain an elastic high performance computing cluster and storage. It likes having a robotic assistant on rent who could prepare food ingredients fast when one is experimenting with them to develop new receipts.  


ETL
Raw data is often received in various of formats and can lead to extended run times for ETL with large datasets. For certain transformations, Biq Query provides a simple and fast path for completing the tasks. For example, the query shown below transforms the "daytime format from integer -- 20090109035439 -- to TIMESTAMP format.


SELECT txId, sender, receiver, value, outAddrId, outIndex, blockHt,
TIMESTAMP(
CONCAT(
CAST (INTEGER(time/10000000000) AS STRING),'-',
CAST (INTEGER((time%10000000000)/100000000)AS STRING),'-',
CAST (INTEGER((time%100000000)/1000000) AS STRING),' ',
CAST (INTEGER((time%1000000)/10000) AS STRING),':',
CAST (INTEGER((time%10000)/100) AS STRING),':',
CAST (INTEGER(time%100) AS STRING)
)) AS txnTS
FROM blockchain.user_edges_raw


In this example, the transformation of 5.83G of data is processed in 136 seconds.


Exploring Data using Datalab
(Texts in red are codes, and texts in blue are output in Datalab(Jupyter) notebook.)
  • Number of transactions over time:
Finding trend, presumably, is the first place to start when exploring new datasets. The example here is to look at the number of transactions done over time. It is not certain whether the timestamp information is adjusted for time zones.


import datalab.bigquery as bq
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import num2date
from matplotlib.dates import date2num
from matplotlib.finance import candlestick_ohlc
import json
from pprint import pprint
from PyQt4 import QtGui, QtCore
import urllib
import requests
import sys
import gzip
import urllib
import datetime
import numpy as np
import pandas as pd
import time
import re
%%sql --module txn
select count(txid) as no_of_txn, txnTS from blockchain.userEdges
group by txnTS
df = bq.Query(txn).to_dataframe()
df.plot(x='txnTS',y='no_of_txn')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8a9305fa10>


%%sql --module btc_sinks
SELECT a.outAddrId, b.address, sum(a.value) as btc_amount
FROM (
SELECT
 outAddrId, value
FROM [plasma-centaur-859:blockchain.userEdges]
) a
JOIN EACH (
SELECT
 id, address
FROM
 [plasma-centaur-859:blockchain.address]
) b
ON a.outAddrId = b.id
WHERE a.outAddrId = b.id
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 500
topSinks=bq.Query(btc_sinks).to_dataframe()
topSinks

outAddrId
address
btc_amount
0
2536935
13hKjDXD5WFZreZzfAGFpgajEdMotuxRvk
2.546664e+07
1
2298461
17URSNmoNqAunWUa3fi6rx22s8BJkAEkPT
5.706508e+06
2
7996957
19Hx7xow6RKTcqney12KYFsZzKpRib8iG1
5.005462e+06
3
17322005
17zRM14vC2z4QtSnANdnthY9LYovnfV84Z
3.403557e+06
4
12557222
18T1hzjughXdG4ucUh8ACDmCYK3baJpCEf
1.771494e+06
5
11204757
17wWMcge7fWV9r6JCLRc4iVmPnW6yGhNUc
1.573960e+06
6
19314133
1NQvPqe2op85q9X9hgSY5GxJERmqoSBmCf
1.569076e+06
7
3524361
1LWBSqy4s7yZrKU6GfFzx7YmsHcXfuksUZ
1.016878e+06
8
6835733
16YgsR6EGWtx3N2X3c6DvTesrKab9qriu6
9.550461e+05
9
7033638
14486XytdA6yKh4cGUGzrmxccaSznXeAAD
9.228888e+05
10
3451099
13SH2dehm6oxFav9dMc7i1NpMsgTVDWbpj
9.014718e+05
11
3517603
1idWZBys5AVK8PBZL2JHwFVe7vnFCe4Qk
8.582446e+05
12
13013852
17R1QnCThtmjkFTm7Xbrq14TyhA2thHLSP
8.181718e+05
13
21419578
1CBhbofpCiHeYekqqvtsD4gN7VRXyy1kV5
7.964936e+05
14
2817770
18om4X8rebEEGYtigU458Z9v8cFajdXshh
6.133265e+05
15
3524210
1PFbC2mZTRfeqRFkLRVeWjjHiugmte5Qhi
6.120741e+05
...
...
...
...
496
403757
17wwDDkFSsyygHJSAwAVbG1RRVrF3xS293
7.900000e+04
497
32647318
12H9E9DpRnYDxRRMRmd8nW9mnZNjd8s2yK
7.869300e+04
498
3025901
19FHRfZAiroNZEqBJjsFemr3NvKdEzJrQ4
7.863082e+04
499
65395
19DRFyzVhG7CwfAQgn2AcpBtx1vTaTuSiL
7.829894e+04
500 rows × 3 columns


  • Each address can be examined for unspent coins.
import json
from blockcypher import get_address_overview
import time
def getAddressInfo(smallDf, token):
 for index, row in smallDf.iterrows():
   data = (get_address_overview(row['address'],'btc',token))
   final_n_tx = data['final_n_tx']
   n_tx = data['n_tx']
   unconfirmed_balance = data['unconfirmed_balance']
   final_balance = data['final_balance']
   balance = data['balance']
   total_sent = data['total_sent']
   address = data['address']
   total_received = data['total_received']
   print(address, total_received, balance, final_n_tx, total_sent)
   temp = {
     "address": address,
     "final_n_tx": final_n_tx,
     "n_tx": n_tx,
     "unconfirmed_balance": unconfirmed_balance,
     "final_balance": final_balance,
     "balance": balance,
     "total_sent": total_sent,
     "total_received": total_received,
   }
   addresses.append(temp)
   #introduce pause
   time.sleep (100.0 / 1000.0);
   
Most API services have rate limit, so short pause and sharding data are required to collect fuller set of data. The API token ‘62aa1f7dbc694b3e939fa8a2b9762f9f’ used below is obtained from Blockcypher (https://www.blockcypher.com).


addresses=[]
df_new1, df_new2, df_new3 = topSinks[:199], topSinks[200:399], topSinks[400:499]

getAddressInfo(df_new1,'62aa1f7dbc694b3e939fa8a2b9762f9f')
(u'13hKjDXD5WFZreZzfAGFpgajEdMotuxRvk', 23231010, 0, 2, 23231010)
(u'17URSNmoNqAunWUa3fi6rx22s8BJkAEkPT', 1147262900, 0, 2, 1147262900)
(u'19Hx7xow6RKTcqney12KYFsZzKpRib8iG1', 35552011, 0, 2, 35552011)
(u'17zRM14vC2z4QtSnANdnthY9LYovnfV84Z', 199999950000, 0, 2, 199999950000)
(u'18T1hzjughXdG4ucUh8ACDmCYK3baJpCEf', 98000004, 0, 2, 98000004)
(u'17wWMcge7fWV9r6JCLRc4iVmPnW6yGhNUc', 1064361310, 0, 2, 1064361310)
(u'1NQvPqe2op85q9X9hgSY5GxJERmqoSBmCf', 81000000, 0, 3, 81000000)
(u'1LWBSqy4s7yZrKU6GfFzx7YmsHcXfuksUZ', 214460712207, 0, 2, 214460712207)
(u'16YgsR6EGWtx3N2X3c6DvTesrKab9qriu6', 1390000, 0, 2, 1390000)
(u'14486XytdA6yKh4cGUGzrmxccaSznXeAAD', 3922905117, 0, 2, 3922905117)
(u'13SH2dehm6oxFav9dMc7i1NpMsgTVDWbpj', 1800000, 0, 2, 1800000)
(u'1idWZBys5AVK8PBZL2JHwFVe7vnFCe4Qk', 1195000, 0, 2, 1195000)
(u'17R1QnCThtmjkFTm7Xbrq14TyhA2thHLSP', 283524240, 0, 2, 283524240)
(u'1CBhbofpCiHeYekqqvtsD4gN7VRXyy1kV5', 50000000, 0, 2, 50000000)
(u'18om4X8rebEEGYtigU458Z9v8cFajdXshh', 78605090, 0, 2, 78605090)
(u'1PFbC2mZTRfeqRFkLRVeWjjHiugmte5Qhi', 12760258243, 0, 162, 12760258243)
(u'1FutgpSdhCwYio5ZrSZGh68mQY9BNYhT9P', 13152165610, 0, 2, 13152165610)
(u'1CRVQTBb4gVnJjhHjKDqzNZas3exqBVNeQ', 87800000, 0, 2, 87800000)
(u'1HjDauL2kth6KJUz5vX198Nvp1xN1hgYRb', 39184702881299, 40000, 396890, 39184702841299)
(u'13HFqPr9Ceh2aBvcjxNdUycHuFG7PReGH4', 38788846237971, 43000, 396569, 38788846194971)
(u'1NWVkQJGNagiKXDPWs364LL71aNRuv1usU', 2055378000, 0, 2, 2055378000)
(u'1NDpZ2wyFekVezssSXv2tmQgmxcoHMUJ7u', 39113236432903, 83001, 397702, 39113236349902)
(u'1Ns6m3RagT5pRNYwxfrYvXS9neUWMYLL7u', 11993137, 1, 3, 11993136)
(u'15fXdTyFL1p53qQ8NkrjBqPUbPWvWmZ3G9', 39357308117256, 28000, 397414, 39357308089256)
(u'17i5LFxeStN9eVzZCTjpmrzk1NNb9c23fG', 2057872000, 0, 2, 2057872000)
(u'1JYwZ8RBRbbgyT5P6QSFVmAdciqcUB7q6Z', 1425000000000, 0, 2, 1425000000000)
(u'1xP3ouLDbuHqgMfg1xyRFfT4o9UJm6RdP', 396670000, 0, 2, 396670000)
(u'15qXBcz51afUKCyZJ8EmdGrrswQnHXUCEV', 36775761, 0, 2, 36775761)
(u'16JBaxknhiMXSWEbrs3LZ8nwLGJabqnoRH', 1568541321900, 0, 2, 1568541321900)
(u'1H591vi5RtA9UXapERSSEmFz8enFN8EqBb', 3955213, 3955213, 22, 0)
(u'1Dfq5VJBzW5dwFJDq4Tv9iinA7pcvPwWps', 1302347, 0, 2, 1302347)
(u'1ByBuY8hcYS5u2GbMhWNWQjf5QvPCFy9NJ', 5013984881, 0, 2, 5013984881)
(u'1BjM7XxmPvgEvoTrJbSJJ7BxUAkQTXSf8f', 353000000, 0, 6, 353000000)
(u'19L2wFKxxWBVHWLt8phYvtQB7tXZ2xvn8s', 3965500000000, 0, 2, 3965500000000)
(u'1HdAYY1eV6KtTDmBFefHGtHa4hEuU26G6k', 5000000000000, 0, 2, 5000000000000)
(u'1Pz8cYwWdgebjW46rBXpVBzy3E3B4uBrRj', 4160100000, 0, 2, 4160100000)
(u'13xstehihBmJHtX6vFNhD4n4ujW5vx5beo', 11415121, 0, 2, 11415121)
(u'1BnspJcLbQ1a2YAWWw1mFkTrudwK7ewLVh', 27315440000, 0, 2, 27315440000)
(u'16PJUzjvEApvzR1g6bM9Vod4yYgfGwjZTY', 14193600464, 0, 142, 14193600464)
(u'12HG1Hkac7kcsSy8vNKVecB8kePVxdpqBK', 10150000, 0, 2, 10150000)
(u'158LPLdQbAnqAsKrD8rUPZuGGFaMgw6jyd', 189900000, 0, 2, 189900000)
(u'158jzBkJFZJz8soep4RR1WALnVgvMD1STr', 5000000000000, 0, 2, 5000000000000)
(u'14yX4kRDME61BbM7DMnuv1PKUtk8K65J4y', 5000000000000, 0, 2, 5000000000000)
(u'1KF8gfEBEQ94tYXQNhn7Am44MGukXGLrxP', 126049100, 0, 2, 126049100)
(u'1CSZZFD7yLr24f3MUYaZWACVHBZN3vfa3M', 9991387866, 0, 22, 9991387866)
(u'14mhTBrgcTKdABnNv1oTuzuUmWKHjadq4M', 1000000, 0, 2, 1000000)
(u'12UKpwJppJ5hxYAfs1DTg1astenuFcHojz', 4000000, 4000000, 1, 0)
(u'17HDmzdYRj2bjL9wu55CRhHUiYwvKVf2p6', 190000, 0, 2, 190000)
(u'1Ev22onN7MqL3L2CLLTaY6YjYYSjXGpoCJ', 25580508185, 0, 2, 25580508185)


  • Check if top receiving addresses match known entities within certain categories. As it is mentioned earlier, the identities and entities of addresses could be found on the Internet. The function as follows is an example of retrieving category information of certain addresses.


from bs4 import BeautifulSoup

known_addresses = []

def lets_get_scraping(url, id):
   r = requests.get(url)
   soup = BeautifulSoup(r.text, 'html.parser')

   for tr in soup.tbody.find_all('tr'):
       for strong_tag in tr.find_all('strong'):
           if id == 1:
               temp = {strong_tag.get_text(): 1}
               known_addresses.append(temp)
           elif id == 2:
               temp = {strong_tag.get_text(): 2}
               known_addresses.append(temp)
           elif id == 3:
               temp = {strong_tag.get_text(): 3}
               known_addresses.append(temp)
           elif id == 4:
               temp = {strong_tag.get_text(): 4}
               known_addresses.append(temp)
   return known_addresses


for i in range(17):
   gambling = 'http://bitcoinwhoswho.com/search/index/index/keyword/gambling/page/{0}'.format(i)
   lets_get_scraping(gambling, id=1)

print('Completed: Gambling')

for i in range(2):
   charity = 'http://bitcoinwhoswho.com/search/index/index/keyword/charity/page/{0}'.format(i)
   lets_get_scraping(charity, id=2)

for i in range(34):
   charity = 'http://bitcoinwhoswho.com/search/index/index/keyword/donate/page/{0}'.format(i)
   lets_get_scraping(charity, id=2)

print('Completed: Charity')

for i in range(17):
   finance = 'http://bitcoinwhoswho.com/search/index/index/keyword/finance/page/{0}'.format(i)
   lets_get_scraping(finance, id=3)

print('Completed: Finance')

for i in range(162):
   exchange = 'http://bitcoinwhoswho.com/search/index/index/keyword/exchange/page/{0}'.format(i)
   lets_get_scraping(exchange, id=3)


Completed: Gambling
Completed: Charity
Completed: Finance
%%sql --module address_with_most_senders
SELECT b.address as addr, count (DISTINCT sender) as sender_counts
FROM
(SELECT
 sender, outAddrId
FROM [plasma-centaur-859:blockchain.userEdges]
) a
JOIN EACH (
SELECT
 id, address
FROM
 [plasma-centaur-859:blockchain.address]
) b
ON a.outAddrId = b.id
WHERE a.outAddrId = b.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20000
most_senderDf=bq.Query(address_with_most_senders).to_dataframe()
Gambling = []
Charity = []
Finance = []
for index, row in most_senderDf.iterrows():
 for known in known_addresses:
      k = known.keys()[0]
      if k == row['addr']:
         if known[k] == 1:
             Gambling.append(row['addr'])
         if known[k] == 2:
             Charity.append(row['addr'])
         if known[k] == 3:
             Finance.append(row['addr'])
print 'number of Gambling -> ' + str(len(Gambling))
print 'number of Charity -> ' + str(len(Charity))
print 'number of Finance -> ' + str(len(Finance))
number of Gambling -> 4
number of Charity -> 0
number of Finance -> 0
4 addresses match gambling sites. And their address:
  • 17mfXdqBGHEg9xacLDNnZSHNvkDS5BBDmn
  • 17mfXdqBGHEg9xacLDNnZSHNvkDS5BBDmn
  • 1Eb7PfCxrn5YZr2XVoYanMmkjrHBTXDcBT
  • 1Eb7PfCxrn5YZr2XVoYanMmkjrHBTXDcBT


  • Rudimentary comparison of Gold and Bitcoin investment performance. Bitcoin and other crypto currencies such as Ether from Ethereum are now considered as an asset class. There is not much good historical data available, so this is mostly exploratory.


#Time series analysis USD/BTC and USD/Gold
def remove_bitcoin_date_duplicates(frame):
   """
   Remove bitcoin date duplicates by using (weighted) aggregation.
   """
   # Aggregate duplicate dates
   frame['price'] = frame['price'] * frame['amount']
   frame = frame.groupby(level=0).sum()
   frame['price'] = np.round(frame['price'] / frame['amount'], 5)
return frame


def convert_to_ohlc(frame, freq='1D'):
   """
   Compute bitcoin OHLC data frame with the given frequency.
   @param frame: raw bitcoin price history.
   @param freq: target OHLC frequency.
   """
   ohlc = frame['price'].resample(freq).ohlc()
   close = frame['price'].resample(freq).last().ffill()
   for column in ['open', 'high', 'low', 'close']:
       ohlc[column] = np.where(np.isnan(ohlc[column]), close, ohlc[column])
   ohlc['amount'] = frame['amount'].resample(freq).last()
   ohlc['amount'].fillna(0.0, inplace=True)
   return ohlc
#load csv from google cloud storage btc prices
%storage read --object gs://testdata-bg-ml/blockchain/mtgoxUSD.csv --variable btcusd
from StringIO import StringIO
btcdf=pd.read_csv(StringIO(btcusd), header=None)
btcdf.columns = ['dt', 'price', 'amount']
btcdf['dt'] = pd.to_datetime(btcdf['dt'], unit='s')
btcdf.set_index(btcdf['dt'], inplace=True)
btcdf.sort_index()
btcdf=btcdf[['price', 'amount']]
remove_bitcoin_date_duplicates(btcdf)
btc_ohlc=convert_to_ohlc(btcdf, freq='1D')
btc_ohlc[len(btc_ohlc)-20:len(btc_ohlc)-1]

open
high
low
close
amount
dt





2014-02-06
9.047000
165039.858852
0.000008
207.244967
0.250000
2014-02-07
41.350000
159797.250000
0.000007
6.956182
0.010000
2014-02-08
6.852200
132837.779797
0.000007
42.847200
0.066000
2014-02-09
3952.879440
97880.026541
0.000006
6.594978
0.010000
2014-02-10
43.333572
174000.000000
0.000005
7.525331
0.012922
2014-02-11
87.968410
88007.660000
0.000006
308.289536
0.532672
2014-02-12
102.511593
253297.287595
0.000005
21.240000
0.040000
2014-02-13
5.400370
126467.532664
0.000005
1804.680000
4.000000
2014-02-14
23.853474
171746.092896
0.000006
851.192311
1.991000
2014-02-15
3.847680
105175.543367
0.000007
89.367218
0.240882
2014-02-16
3.661012
216377.515944
0.000005
7.492550
0.025000
2014-02-17
74.591745
67943.258101
0.000006
272.500000
1.000000
2014-02-18
2394.482759
57212.401320
0.000005
295.634055
1.006243
2014-02-19
2.936000
36397.654926
0.000006
1308.491450
5.000000
2014-02-20
14.273712
58500.000000
0.000006
20.074046
0.179719
2014-02-21
89.295960
60000.000000
0.000005
1.114000
0.010000
2014-02-22
1011.679765
56805.147540
0.000005
2.782257
0.010888
2014-02-23
110.700008
35250.000000
0.000006
196.181870
0.632845
2014-02-24
3.149990
126521.979120
0.000006
1.872930
0.010774


btc_ohlc['close'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb44ba5a950>
%%sql --module goldusd
SELECT DATE(date) AS dt, price as close
FROM blockchain.gold2usdDaily
WHERE DATE(date) >= DATE('2010-07-17') AND DATE(date) <= DATE('2014-02-25')
ORDER BY dt
golddf=bq.Query(goldusd).to_dataframe().set_index('dt')
golddf.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb4bd3e93d0>
  • Gold and Bitcoin prices are normalized in scale.


data = golddf.join(btc_ohlc['close'],how='left', lsuffix='_gold', rsuffix='_btc')
data['gold_scaled'] = data['close_gold']/max(data['close_gold'])
data['btc_scaled'] = data['close_btc']/max(data['close_btc'])
data_scaled=pd.concat([data['gold_scaled'],data['btc_scaled']],axis=1)
data_scaled.plot(figsize=(20,15))
<matplotlib.axes._subplots.AxesSubplot at 0x7fb4bcf2e510>


  • Use log to compare the daily returns of Gold and Bitcoin


log_return = pd.DataFrame()
log_return['gold_return'] = np.log(data['close_gold']/data['close_gold'].shift())
log_return['btc_return'] = np.log(data['close_btc']/data['close_btc'].shift())
log_return.describe()

gold_return
btc_return
count
941.000000
941.000000
mean
0.000133
0.003735
std
0.011928
3.511937
min
-0.095962
-17.665482
max
0.048387
18.281800


_ = pd.concat([log_return['gold_return'], log_return['btc_return']], axis=1).plot(figsize=(20,15), color='darkgreen')
from pandas.tools.plotting import scatter_matrix
_ = scatter_matrix (log_return, figsize=(5,5), diagonal='kde')


log_return.corr(method='spearman',min_periods=1)

gold_return
btc_return
gold_return
1.000000
-0.068212
btc_return
-0.068212
1.000000





The scattered plot and correlation matrix indicates that Bitcoin and Gold prices are not correlated much. Comparing with Gold, Bitcoin has much higher volatility, which can make it hard to use for pricing and trading goods and services. From the investment perspective, it could be a good option for diversification.


Conclusion
The simple experimentation tell two stories as follows with data:
  1. Even Bitcoin transactions are pseudonymous, with enough computing and analytical power, and related information from the web; the identities of senders and receivers, and flows of money could still be traced with good probabilities. The concern about its use in money laundering is much overblown; however, privacy is a valid concern. But people trade privacy for utility or convenience all the time: e.g. the use of social media, credit cards, loyalty programs, etc.;
  2. Bitcoin offers higher return on investments than Gold between 2010 and 2014 but with much higher volatility.

More stories can be explored with additional experiments and inclusions of richer datasets such as social graphs, economic indices, world bank, and Gdelt. More and more powerful and easy to use tools will allow people imagine and experiment more and therefore expediting the creation of new knowledge. One limitation with BigQuery is that large result set request from Datalab will fail, while there is no problem when loading from Cloud Storage. It is hoped that there will be no such limitation in the near future and integration to other tools such as SPARKS will also be available.

Juypter notebook: https://github.com/pblin/bernardNotebook/blob/master/BTCTX-Copy3.ipynb