Bitdash Crawler: A Watchglass-Based Bitcoin Network Crawler

August 10th, 2021

Introducing the Bitdash Crawler, a simple Bitcoin network crawler that leverages Watchglass to interface with the Bitcoin network protocol. I decided to write this tool because other Bitcoin network monitoring tools no longer publish results for TRB, which used to be the only reason I bothered checking those sites in the first place. What I am publishing here is the source for the crawler, which continuously scans the Bitcoin network by recursively sending version and getaddr messages. The results are then stored in a Postgres database. You can browse the results at bitdash.io/nodes or you can download and run this program yourself to create your own copy of the network map.

Crawler Design

A main thread starts up three worker threads and a "heartbeat" thread which logs program state at a set interval (useful for monitoring/debugging).

fill_probe_queue
  • Responsible for refilling the probe_queue whenever it is completely emptied.
  • Fills the queue from the list of known nodes in the DB. If the DB is empty (e.g. on the first run or if manually cleared) it will read from the list of seed nodes in nodes.txt.
probe_nodes
  • Takes jobs (nodes to be probed) from the probe_queue and spins up a probe_node thread for each one (up to the limit set in the config via max_sockets).
  • The probe_node child thread attempts to open a socket connection with the host and send a version message. If successful it then sends a getaddr message to ask for a list of connected peers.
  • When the requests either succeed, fail, or timeout the results are then added to the result_queue. Spam nodes are not added to the result_queue.
insert_results
  • Takes jobs (results to be inserted into the db) from the result_queue and calls the method for inserting them into the database
  • Also processes any peers included in the result and adds those that do not exist in the DB to the probe_queue

That's it. No kubernetes cloud vps botnet swarms or whatever the cool kids are doing these days—and it completely ignores IPv6 and "Tor" address spaces. It's just a single Python 2.7 script that'll run fine on a low power ARM processor with 2GB of RAM. On my modest Rockchip with max_sockets set to 800 it completes a full scan (including ~half a million spam peers—I have not implemented blacklisting yet) in around 90 minutes (between 80 and 130 depending on number of spam peers unearthed in a pass). And this is the same server that is also hosting the www as well as the Bitdash logs mirror.

A Note About the 'max_sockets' Throttle

I currently have mine set to 800, this is with ~19k already discovered nodes (any node which has at least once responded with a version message) in the database. It should probably be set to a lower number on the first run (maybe start with 50) and monitor to make sure you aren't DDoS'ing the nodes in the nodes.txt seed list. Recall that a "pass" won't finish until all of the initial nodes and recursively added peers have been scanned once, so if the seed nodes provide good peer lists you may end up with a good chunk of the network after the first pass. If you're getting lots of peers and it feels like it'll take forever to complete the first pass, you can kill the crawler, up the max_sockets, and restart the script.

Watchglass as a Library

As mentioned, I did not write the pieces that handle the actual Bitcoin protocol communication, for that I used asciilifeform's Watchglass. However, Watchglass in its current form includes an IRC bot and is set up to be run as its own script. For my purposes I wanted only the Bitcoin protocol portion without the IRC bot. Asciilifeform pointed me to wedger.py which is pretty close but it also contains a bombard_node method and is set up to be configured via constants in the file.

What I ended up doing is including a "watchglass.py" in the crawler genesis that is made up strictly of the functions and helper methods required for interfacing with the Bitcoin network protocol and is designed to be imported as a library in other Python scripts (i.e. it does not instantiate its own logger and it does not read from a config or set of configurable constants). Where necessary I updated function signatures to allow for passing in parameters which previously would have been set in a config. I also added one method to watchglass.py that I feel is generic enough to be included in the library, unpack_ver_msg, which takes a raw payload and returns the discrete values as a dictionary.

Including this crawler there are now three applications that leverage the Watchglass protocol methods. My thinking is that perhaps it can be moved to its own V-Tree and the Watchglass IRC bot, the wedger, and this crawler can be updated to rely on the Watchglass library.

Roadmap: Crawler

  • Implement an exponential backoff for querying nodes that 1) fail to respond and 2) have never responded. This should significantly speed up batch time and reduce the number of max_sockets needed to hit a given target processing interval.
  • Implement a geoIP-lookup feature even though this will require the use of a 3rd party service.
  • Implement storage of long term historical snapshots somewhere in the database. Currently it only stores the last 25 results for each known host (configurable) but if one wanted to display, for example, a graph of TRB nodes over time they would not be able to get the data they need from the current schema.

Roadmap: Website

  • Responsive CSS so that it works well on different screen sizes.
  • An improved homepage with a collection of key overview metrics.
  • Drill down filters for status, version, user agent, and species.

Patch and Signature

bitdash_crawler_genesis.vpatch
bitdash_crawler_genesis.vpatch.billymg.sig

« HTML/CSS Improvements and a New Dark Theme for Asciilifeform's Logotron
Patch Fixes for the Logotron and Bitdash Crawler »

3 Comments

  1. Hey, pretty slick, congrats.

    I was reading the site and noticed that while the trb page is pretty much a copy of the foundation's how to, it doesn't capture the whole picture as mod6 had either already resigned or was on the way out when the code below was released.

    Ada isn't a hard requirement for keccak V since JFW's cleaned up v.pl and made it work with keksum.

    Also, the whole buildroot dance isn't a hard requirement either since JFW's system compiler patch/. Granted, you'd still want a musl-static environment, which Gales Linux is. I've only tested/used the system compiler patch there.

  2. billymg says:

    Hey, thanks! For whatever reason your comment got eaten by the spam trap and I didn't see it until now.

    Yup, the TRB page was just a copy/paste of the mod6's HTML, meant only to serve as a placeholder (though I guess it's been serving as that for quite a while now). My plan is still to produce an updated version with more detail for those completely new to TRB and V. I appreciate the notes on the cleaned up v.pl and TRB system-based compiler. I will test those out and include them in the guide.

    I actually did test out JFW's new TRB build system when I got stuck during the rotor install on my box. Unfortunately it didn't work for me at the time (and I forget where it barfed), but if you say it requires a musl environment then that was most likely it. I'm going to work on another box soon though (musl this time) and will try it again on that one.

    Good to hear from you. Not sure if you've seen any of the talk around Pest but there's now a proto-network running and getting some use. It's a different atmosphere from the old days but most seem to be staying productive and putting out interesting projects nonetheless. Let me know if you want to peer with my station to try it out sometime.

  3. [...] on a 2022 Vintage Musl GentooJacob Welsh on Building TRB on a 2022 Vintage Musl Gentoobillymg on Bitdash Crawler: A Watchglass-Based Bitcoin Network CrawlerRobinson Dorion on Bitdash Crawler: A Watchglass-Based Bitcoin Network Crawler [...]

Leave a Reply

*
*

You can use the following HTML tags in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>