Bitdash Crawler: A Watchglass-Based Bitcoin Network Crawler

August 10th, 2021

Introducing the Bitdash Crawler, a simple Bitcoin network crawler that leverages Watchglass to interface with the Bitcoin network protocol. I decided to write this tool because other Bitcoin network monitoring tools no longer publish results for TRB, which used to be the only reason I bothered checking those sites in the first place. What I am publishing here is the source for the crawler, which continuously scans the Bitcoin network by recursively sending version and getaddr messages. The results are then stored in a Postgres database. You can browse the results at bitdash.io/nodes or you can download and run this program yourself to create your own copy of the network map.

Crawler Design

A main thread starts up three worker threads and a "heartbeat" thread which logs program state at a set interval (useful for monitoring/debugging).

fill_probe_queue

  • Responsible for refilling the probe_queue whenever it is completely emptied.
  • Fills the queue from the list of known nodes in the DB. If the DB is empty (e.g. on the first run or if manually cleared) it will read from the list of seed nodes in nodes.txt.

probe_nodes

  • Takes jobs (nodes to be probed) from the probe_queue and spins up a probe_node thread for each one (up to the limit set in the config via max_sockets).
  • The probe_node child thread attempts to open a socket connection with the host and send a version message. If successful it then sends a getaddr message to ask for a list of connected peers.
  • When the requests either succeed, fail, or timeout the results are then added to the result_queue. Spam nodes are not added to the result_queue.

insert_results

  • Takes jobs (results to be inserted into the db) from the result_queue and calls the method for inserting them into the database
  • Also processes any peers included in the result and adds those that do not exist in the DB to the probe_queue

That's it. No kubernetes cloud vps botnet swarms or whatever the cool kids are doing these days—and it completely ignores IPv6 and "Tor" address spaces. It's just a single Python 2.7 script that'll run fine on a low power ARM processor with 2GB of RAM. On my modest Rockchip with max_sockets set to 800 it completes a full scan (including ~half a million spam peers—I have not implemented blacklisting yet) in around 90 minutes (between 80 and 130 depending on number of spam peers unearthed in a pass). And this is the same server that is also hosting the www as well as the Bitdash logs mirror.

A Note About the 'max_sockets' Throttle

I currently have mine set to 800, this is with ~19k already discovered nodes (any node which has at least once responded with a version message) in the database. It should probably be set to a lower number on the first run (maybe start with 50) and monitor to make sure you aren't DDoS'ing the nodes in the nodes.txt seed list. Recall that a "pass" won't finish until all of the initial nodes and recursively added peers have been scanned once, so if the seed nodes provide good peer lists you may end up with a good chunk of the network after the first pass. If you're getting lots of peers and it feels like it'll take forever to complete the first pass, you can kill the crawler, up the max_sockets, and restart the script.

Watchglass as a Library

As mentioned, I did not write the pieces that handle the actual Bitcoin protocol communication, for that I used asciilifeform's Watchglass. However, Watchglass in its current form includes an IRC bot and is set up to be run as its own script. For my purposes I wanted only the Bitcoin protocol portion without the IRC bot. Asciilifeform pointed me to wedger.py which is pretty close but it also contains a bombard_node method and is set up to be configured via constants in the file.

What I ended up doing is including a "watchglass.py" in the crawler genesis that is made up strictly of the functions and helper methods required for interfacing with the Bitcoin network protocol and is designed to be imported as a library in other Python scripts (i.e. it does not instantiate its own logger and it does not read from a config or set of configurable constants). Where necessary I updated function signatures to allow for passing in parameters which previously would have been set in a config. I also added one method to watchglass.py that I feel is generic enough to be included in the library, unpack_ver_msg, which takes a raw payload and returns the discrete values as a dictionary.

Including this crawler there are now three applications that leverage the Watchglass protocol methods. My thinking is that perhaps it can be moved to its own V-Tree and the Watchglass IRC bot, the wedger, and this crawler can be updated to rely on the Watchglass library.

Roadmap: Crawler

  • Implement an exponential backoff for querying nodes that 1) fail to respond and 2) have never responded. This should significantly speed up batch time and reduce the number of max_sockets needed to hit a given target processing interval.
  • Implement a geoIP-lookup feature even though this will require the use of a 3rd party service.
  • Implement storage of long term historical snapshots somewhere in the database. Currently it only stores the last 25 results for each known host (configurable) but if one wanted to display, for example, a graph of TRB nodes over time they would not be able to get the data they need from the current schema.

Roadmap: Website

  • Responsive CSS so that it works well on different screen sizes.
  • An improved homepage with a collection of key overview metrics.
  • Drill down filters for status, version, user agent, and species.

Patch and Signature

bitdash_crawler_genesis.vpatch
bitdash_crawler_genesis.vpatch.billymg.sig

———