The Bitdash Crawler has been running steadily since August of last year. The first release was minimal—only the basic crawling functionality, and only collecting data returned directly by the probed nodes. This patch adds additional data collection, as well as a new IRC interface for basic status reports.
When I first looked into adding this I was reluctant because I thought the only way was to connect to a third-party service via API keys. I didn't want the crawler to have to rely on any third-party service, as those come with the risk that they will change out from under you at any moment and without notice. Thankfully a user in #asciilifeform pointed me to Maxmind's excellent GeoLite2 geolocation database. I was almost surprised at how simple it was and that there was no catch. Yes, you have to provide an email and create an account with them, but after that they just give you their geolocation database1, in CSV format that you can take wherever you like. They even provide clear documentation for creating the schema and importing the CSV data. Overall a very nice offering and I'd recommend it to anyone who needs to add this to their project.
Time Series Data
I'm working on a redesign of the www interface for the crawler2 and it includes charts. Previously the crawler was only storing current snapshots of the data it collected, now it's also generating some aggregate statistics3, taken at set intervals, and storing them indefinitely. This job is handled separately from the main crawler script, via a set of SQL queries run from a cron job.
IRC Bot Interface
With the addition of geolocation and time series data collection the crawler can finally provide some reports its underlying library alone cannot. Soon this data will be available via the www interface but in the meantime I wanted to make it available in the chans via an IRC bot. The bot is based on asciilifeform's logotron bot, with which I've become familiar since using it to power the logger at logs.bitdash.io. It works roughly like this:
billymg | !c help crawlerbot | billymg: my valid commands are: src, uptime, help, net-summary, version, trb-status billymg | !c net-summary crawlerbot | Bitcoin Network (IPv4 Nodes Active Within the Last 48 hours) Global: 8166; TRB-Compatible: 61; TRB: 13 crawlerbot | TRB-Compatible by Country: United States: 26; Canada: 4; Singapore: 4; Romania: 4; Russia: 3; France: 2; United Kingdom: 2; Italy: 1; Lithuania: 1; Norway: 1; Australia: 1; Germany: 1; Chile: 1; Belgium: 1; Spain: 1; Ukraine: 1; Netherlands: 1; Finland: 1; Sweden: 1; Switzerland: 1; Bulgaria: 1; Mexico: 1; South Africa: 1; crawlerbot | TRB by Country: United States: 7; Canada: 1; Romania: 1; Singapore: 1; Lithuania: 1; France: 1; Norway: 1;
Patches and Signatures
I unfortunately had to regrind the genesis patch and the small bug fix patch that followed because there was a typo in the root directory's name. I'll leave the original two patches up for archival purposes but this patch is built on a new tree, with the genesis regrind also including the small fix from the original second patch.
- Presumably a subset of the data included in their full commercial offering, but plenty good enough for my needs. [↩]
- Which at the moment is nothing more than a few Sketch mockups. [↩]
- Network breakdown by major user agent version, network breakdown by country, TRB breakdown by country, and TRB-compatible breakdown by country. All stats are collected both for recently active nodes (48hrs) and for recently active nodes returning at least one non-self peer (referred to as "participating"). [↩]