Patch Fixes for the Logotron and Bitdash Crawler

September 16th, 2021

Two quick patches to fix two small bugs, one for the logotron and one for the bitdash crawler. The logotron patch fixes a css bug in the classic theme where multiple selected loglines would have their highlight rendered incorrectly. The crawler patch fixes the name of the unique index on the host field and drops the explicit creation of this index in the schema as it is created automatically for the unique-constrained field.

Patches and Signatures

Crawler

bitdash_crawler_fix_idx_name.vpatch
bitdash_crawler_fix_idx_name.vpatch.billymg.sig

Logotron

fix_multiln_hlite.kv.vpatch
fix_multiln_hlite.kv.vpatch.billymg.sig

The complete V-Trees for the logotron and crawler have also been updated to include these patches.

Bitdash Crawler: A Watchglass-Based Bitcoin Network Crawler

August 10th, 2021

Introducing the Bitdash Crawler, a simple Bitcoin network crawler that leverages Watchglass to interface with the Bitcoin network protocol. I decided to write this tool because other Bitcoin network monitoring tools no longer publish results for TRB, which used to be the only reason I bothered checking those sites in the first place. What I am publishing here is the source for the crawler, which continuously scans the Bitcoin network by recursively sending version and getaddr messages. The results are then stored in a Postgres database. You can browse the results at bitdash.io/nodes or you can download and run this program yourself to create your own copy of the network map.

Crawler Design

A main thread starts up three worker threads and a "heartbeat" thread which logs program state at a set interval (useful for monitoring/debugging).

fill_probe_queue
  • Responsible for refilling the probe_queue whenever it is completely emptied.
  • Fills the queue from the list of known nodes in the DB. If the DB is empty (e.g. on the first run or if manually cleared) it will read from the list of seed nodes in nodes.txt.
probe_nodes
  • Takes jobs (nodes to be probed) from the probe_queue and spins up a probe_node thread for each one (up to the limit set in the config via max_sockets).
  • The probe_node child thread attempts to open a socket connection with the host and send a version message. If successful it then sends a getaddr message to ask for a list of connected peers.
  • When the requests either succeed, fail, or timeout the results are then added to the result_queue. Spam nodes are not added to the result_queue.
insert_results
  • Takes jobs (results to be inserted into the db) from the result_queue and calls the method for inserting them into the database
  • Also processes any peers included in the result and adds those that do not exist in the DB to the probe_queue

That's it. No kubernetes cloud vps botnet swarms or whatever the cool kids are doing these days—and it completely ignores IPv6 and "Tor" address spaces. It's just a single Python 2.7 script that'll run fine on a low power ARM processor with 2GB of RAM. On my modest Rockchip with max_sockets set to 800 it completes a full scan (including ~half a million spam peers—I have not implemented blacklisting yet) in around 90 minutes (between 80 and 130 depending on number of spam peers unearthed in a pass). And this is the same server that is also hosting the www as well as the Bitdash logs mirror.

A Note About the 'max_sockets' Throttle

I currently have mine set to 800, this is with ~19k already discovered nodes (any node which has at least once responded with a version message) in the database. It should probably be set to a lower number on the first run (maybe start with 50) and monitor to make sure you aren't DDoS'ing the nodes in the nodes.txt seed list. Recall that a "pass" won't finish until all of the initial nodes and recursively added peers have been scanned once, so if the seed nodes provide good peer lists you may end up with a good chunk of the network after the first pass. If you're getting lots of peers and it feels like it'll take forever to complete the first pass, you can kill the crawler, up the max_sockets, and restart the script.

Watchglass as a Library

As mentioned, I did not write the pieces that handle the actual Bitcoin protocol communication, for that I used asciilifeform's Watchglass. However, Watchglass in its current form includes an IRC bot and is set up to be run as its own script. For my purposes I wanted only the Bitcoin protocol portion without the IRC bot. Asciilifeform pointed me to wedger.py which is pretty close but it also contains a bombard_node method and is set up to be configured via constants in the file.

What I ended up doing is including a "watchglass.py" in the crawler genesis that is made up strictly of the functions and helper methods required for interfacing with the Bitcoin network protocol and is designed to be imported as a library in other Python scripts (i.e. it does not instantiate its own logger and it does not read from a config or set of configurable constants). Where necessary I updated function signatures to allow for passing in parameters which previously would have been set in a config. I also added one method to watchglass.py that I feel is generic enough to be included in the library, unpack_ver_msg, which takes a raw payload and returns the discrete values as a dictionary.

Including this crawler there are now three applications that leverage the Watchglass protocol methods. My thinking is that perhaps it can be moved to its own V-Tree and the Watchglass IRC bot, the wedger, and this crawler can be updated to rely on the Watchglass library.

Roadmap: Crawler

  • Implement an exponential backoff for querying nodes that 1) fail to respond and 2) have never responded. This should significantly speed up batch time and reduce the number of max_sockets needed to hit a given target processing interval.
  • Implement a geoIP-lookup feature even though this will require the use of a 3rd party service.
  • Implement storage of long term historical snapshots somewhere in the database. Currently it only stores the last 25 results for each known host (configurable) but if one wanted to display, for example, a graph of TRB nodes over time they would not be able to get the data they need from the current schema.

Roadmap: Website

  • Responsive CSS so that it works well on different screen sizes.
  • An improved homepage with a collection of key overview metrics.
  • Drill down filters for status, version, user agent, and species.

Patch and Signature

bitdash_crawler_genesis.vpatch
bitdash_crawler_genesis.vpatch.billymg.sig

HTML/CSS Improvements and a New Dark Theme for Asciilifeform's Logotron

May 25th, 2021

A few weeks ago I set out to finally do something with a domain I'd been holding for the last few years, bitdash.io. As someone who both designs and codes user interfaces I wanted1 to have a place for displaying useful metrics related to the bitcoin network. However, since I had also been reading the logs since 2015, I knew enough to know that you don't just manalone and create something because you "just wanted to". So, when I joined the republic formerly known as TMSR in 2018 I joined as an apprentice. I saw mp-wp2 as an opportunity to learn while working. Consisting of a www app ultimately rendered in plain old HTML and CSS it presented itself as something I could contribute to given my existing skillset, but also something I could learn from (the PHP/MySQL parts were new to me, not to mention V).

Code was trimmed3, a feature was added, patches were produced, and in 2020 TMSR was closed. I had also at this time just moved to Costa Rica, and coincidentally, the world had decided to descend into full-blown authoritarianism—everywhere. And so I took a break from these things to study Spanish, take care of some much needed renovations at the ranch, and generally just enjoy the new country IRL. But one can only stay idle4 for so long, and eventually the desire to get back in front of a terminal and do something worthwhile exceeded my desire to relax.

So, back to bitdash.io. What data could I publish that would actually be useful to someone, and that wasn't already published on 1001 other bit-this, blockchain-that websites? Well, how about the one thing none of these other sites publish: anything related to TRB. I started with the simplest metric that I myself used to like to check now and then, the number of active TRB nodes on the network. The two sites that used to display this metric, however, no longer do. So I wrote a simple crawler5 and ended up with some interesting results. These findings gave me enough encouragement to want to continue with their publication, so that perhaps others may also see what I saw. It also gave me a clearer picture of what I hope to accomplish overall, and this made visible a deficit in my current stack, which is a place for both collaborating with other towards this goal and for sorting newcomers. And so, there is now #billymg on IRC, and logs.bitdash.io for those that wish to participate or just follow along. This article, with the exception of the former introduction, is about my small contribution to the code that renders those logs.

The First Patch

While working on the crawler I had already started to design a theme for the overall bitdash.io site and I figured the logs hosted on the same domain should have a similar visual aesthetic. So I configured asciilifeform's logotron and took a look at the HTML and CSS it was generating. There were a lot of things I didn't like6 so I first proceeded to rewrite/reorganize before writing my new theme. In doing so I also ended up refactoring a Python function that spit out an HTML string directly rather than simply passing data to an HTML template7. The first patch linked in this article contains these changes as well as a few minor functionality changes/additions. For those skimming, the changes in the first patch are as follows:

  1. A fairly comprehensive HTML/CSS refactor, as described above.
  2. Search queries less than Min_Query_Length (3 by default) characters redirect to the homepage rather than resulting in a 500 error.
  3. The ability to customize the logger's root path via app_root in the config8.
  4. The ability to point to a different CSS file via css_file in the config.
  5. A bunch more "bots" added to the default bots field in the config.
  6. Some README updates with setup clarifications and a new reverse-chronological 'release_notes.txt' file.

Here is that gen_chanlist function I mentioned:

1 def gen_chanlist(selected_chan, show_all_chans=False):
2 # Get current time
3 now = datetime.now()
4 # Data for channel display :
5 chan_list = []
6 chan_idx = 0
7 for chan in Channels:
8 last_time = query_db(
9 '''select t, idx from loglines where chan=%s
10 and idx = (select max(idx) from loglines where chan=%s) ;''',
11 [chan, chan], one=True)
12
13 last_time_txt = ""
14 last_time_url = ""
15 if last_time != None:
16 span = (now - last_time['t'])
17 days = span.days
18
19 # Only add to the list if it should be visible, otherwise continue
20 if days > Days_Hide and chan != selected_chan and not show_all_chans:
21 continue
22
23 hours = span.seconds/3600
24 minutes = (span.seconds%3600)/60
25
26 if days != 0:
27 last_time_txt += '%dd ' % days
28 if hours != 0:
29 last_time_txt += '%dh ' % hours
30 if minutes != 0:
31 last_time_txt += '%dm' % minutes
32
33 last_time_url = "{0}{1}{2}/{3}#{4}".format(
34 get_base(),
35 App_Root,
36 chan,
37 last_time['t'].strftime(Date_Short_Format),
38 last_time['idx'])
39
40 chan_list.append({ 'name': chan })
41
42 chan_list[chan_idx]['last_time_url'] = last_time_url
43 chan_list[chan_idx]['last_time_txt'] = last_time_txt
44 chan_list[chan_idx]['chan_url'] = "{0}{1}{2}{3}".format(
45 get_base(), App_Root, chan, '/' if chan == Default_Chan else '')
46
47 chan_idx += 1
48
49 return chan_list

And in the HTML9 template:

1 <table class='chan-list' align="center">
2 <thead>
3 <tr>
4 {% for chan_item in chan_list %}
5 <th>
6 <a class='chan-link {% if chan_item.name == chan %}chan-link-active{% endif %}'
7 href='{{ chan_item.chan_url }}'><b>{{ chan_item.name }}</b></a>
8 </th>
9 {% endfor %}
10 </tr>
11 </thead>
12 <tbody>
13 <tr>
14 {% for chan_item in chan_list %}
15 <td>
16 <a class='chan-last-active-link' href='{{ chan_item.last_time_url }}'>{{ chan_item.last_time_txt }}</a>
17 </td>
18 {% endfor %}
19 </tr>
20 </tbody>
21 </table>

Compared to before:

1 def gen_chanlist(selected_chan, show_all_chans=False):
2 # Get current time
3 now = datetime.now()
4 # Data for channel display :
5 chan_tbl = {}
6 for chan in Channels:
7 chan_tbl[chan] = {}
8 chan_tbl[chan]['show'] = False
9
10 chan_formed = chan
11 if chan == selected_chan:
12 chan_formed = "<span class='highlight'>" + chan + "</span>"
13
14 chan_tbl[chan]['link'] = """<a href="{0}log/{1}"><b>{2}</b></a>""".format(
15 get_base(), chan, chan_formed)
16
17 last_time = query_db(
18 '''select t, idx from loglines where chan=%s
19 and idx = (select max(idx) from loglines where chan=%s) ;''',
20 [chan, chan], one=True)
21
22 last_time_txt = ""
23 time_field = ""
24 if last_time != None:
25 span = (now - last_time['t'])
26 days = span.days
27 hours = span.seconds/3600
28 minutes = (span.seconds%3600)/60
29
30 if days != 0:
31 last_time_txt += '%dd ' % days
32 if hours != 0:
33 last_time_txt += '%dh ' % hours
34 if minutes != 0:
35 last_time_txt += '%dm' % minutes
36
37 time_field = """<i><a href="{0}log/{1}/{2}#{3}">{4}</a></i>""".format(
38 get_base(),
39 chan,
40 last_time['t'].strftime(Date_Short_Format),
41 last_time['idx'],
42 last_time_txt)
43
44 if (days <= Days_Hide) or (chan == selected_chan) or show_all_chans:
45 chan_tbl[chan]['show'] = True
46
47 chan_tbl[chan]['time'] = time_field
48
49 ## Generate channel selector bar :
50 s = """<table align="center" class="chantable"><tr>"""
51 for chan in Channels:
52 if chan_tbl[chan]['show']:
53 s += """<th>{0}</th>""".format(chan_tbl[chan]['link'])
54 s += "</tr><tr>"
55 ## Generate last-activ. links for above :
56 for chan in Channels:
57 if chan_tbl[chan]['show']:
58 s += """<td>{0}</td>""".format(chan_tbl[chan]['time'])
59 # wrap up:
60 s += "</tr></table>"
61 return s

I personally find it much easier to read now. Also, when one is following the README, which before stated, "Adjust the three 'flask' templates in 'templates' subdir to give the desired look and feel for the www end", they can now change the look of their chan list as well, without have to modify any Python.

Now about that query. I did not fix that in either of these patches, but I did do some quick testing on my RK if anyone is curious about the results:

Current:
1 SELECT t, idx FROM loglines WHERE chan='asciilifeform' AND idx = (SELECT max(idx) FROM loglines WHERE chan='asciilifeform');
2
3 t | idx
4 ---------------------------+---------
5 2021-05-25 18:51:44.88612 | 1037761
6 (1 row)
7
8 Time: 158.185 ms
9
10 SELECT t, idx FROM loglines WHERE chan='trilema' AND idx = (SELECT max(idx) FROM loglines WHERE chan='trilema')
11 nsalog-# ;
12 t | idx
13 ----------------------------+---------
14 2020-03-13 08:47:33.022321 | 1959633
15 (1 row)
16
17 Time: 3.977 ms
18
19 SELECT t, idx FROM loglines WHERE chan='ossasepia' AND idx = (SELECT max(idx) FROM loglines WHERE chan='ossasepia');
20 t | idx
21 ----------------------------+---------
22 2020-09-22 03:57:31.860485 | 1028603
23 (1 row)
24
25 Time: 163.218 ms
26
27 Total Time: 325.38 ms
As a UNION:
1 SELECT chan, t, idx FROM loglines WHERE chan='asciilifeform' AND idx = (SELECT max(idx) FROM loglines WHERE chan='asciilifeform') UNION
2 SELECT chan, t, idx FROM loglines WHERE chan='trilema' AND idx = (SELECT max(idx) FROM loglines WHERE chan='trilema') UNION
3 SELECT chan, t, idx FROM loglines WHERE chan='ossasepia' AND idx = (SELECT max(idx) FROM loglines WHERE chan='ossasepia');
4
5 chan | t | idx
6 ---------------+----------------------------+---------
7 asciilifeform | 2021-05-25 18:51:44.88612 | 1037761
8 ossasepia | 2020-09-22 03:57:31.860485 | 1028603
9 trilema | 2020-03-13 08:47:33.022321 | 1959633
10 (3 rows)
11
12 Time: 292.233 ms
As a UNION with knowledge of the last time queried:
1 SELECT chan, t, idx FROM loglines WHERE chan='asciilifeform' AND idx = (SELECT max(idx) FROM loglines WHERE chan='asciilifeform' AND t > '2021-05-24') UNION
2 SELECT chan, t, idx FROM loglines WHERE chan='trilema' AND idx = (SELECT max(idx) FROM loglines WHERE chan='trilema' AND t > '2020-03-12') UNION
3 SELECT chan, t, idx FROM loglines WHERE chan='ossasepia' AND idx = (SELECT max(idx) FROM loglines WHERE chan='ossasepia' AND t > '2020-09-22');
4
5 chan | t | idx
6 ---------------+----------------------------+---------
7 asciilifeform | 2021-05-25 18:51:44.88612 | 1037761
8 ossasepia | 2020-09-22 03:57:31.860485 | 1028603
9 trilema | 2020-03-13 08:47:33.022321 | 1959633
10 (3 rows)
11
12 Time: 167.390 ms

Overall quite good, and likely even better in production with a more fine-grained time filter. I'll likely include this improvement in my next patch for the logotron if no one else beats me to it.

The Second Patch

The second patch is just a CSS file and a small snippet of HTML for the chan list. I simply could not reconcile the two chan list formats to achieve what I wanted to with my theme and also leave alf's theme intact even in terminal-based browsers. The problem wasn't even the table, that was easy to contort into a list with td { display: list-item }. What ultimately made it unworkable was the fact that the original table used two separate table rows for the chan names and the last active times. Meaning the order of elements in the DOM was not chan1, chan1_ts, chan2, chan2_ts... but rather chan1, chan2...chan1_ts, chan2_ts... (you can witness this yourself when loading logs.nosuchlabs.com/log and tabbing through the channels at the top). At this point I gave up and decided to just include 'chan-nav-list.html' in the templates folder. By default it is unused, but one can point to it by changing one line in 'templates/layout.html', which is required if using the theme at 'static/bitdash.css'.

For those that would like to try these changes out, you will need these two patches and signatures, as well as the entire logotron tree from before (which is now mirrored here on this site).

frontend_updates.kv.vpatch
frontend_updates.kv.vpatch.billymg.sig

add_bitdash_theme.kv.vpatch
add_bitdash_theme.kv.vpatch.billymg.sig

There is also another logotron out there that I would like to try for myself and it is my plan to do so when I return to mp-wp work after first publishing the crawler.

  1. Why? I don't know, it just seemed like a "cool idea" (talk about towards purpose over from causes). []
  2. The blogging platform of choice for many republican members, based off of a trimmed down fork of WordPress 2.7 []
  3. A bit of an understatement, nearly 50% was hacked off. []
  4. This "idle" of mine was still probably twice as demanding as what most chair warmers in various HR departments consider "busy". []
  5. Using asciilifeform's Watchglass as my BTC protocol library []
  6. Hard to read, HTML mixed in with Python, a mix of HTML and CSS styling, "!important" hacks for no reason, redundant or dead HTML/CSS []
  7. This should generally only be done as a last resort or in very small/targeted doses as it defeats the purpose of code/markup separation and hurts overall readability. []
  8. Note that 'app_root' is somewhat of a misnomer as search, for example, appears at '$base_url/log-search', rather than '$base_url/$app_root/search'. This is something I would like to change but it would result in a lot of broken links in the existing log data that would have to be bulk updated and so I left it alone for now. []
  9. Jinja []