I've been using asciilifeform's logotron to power my IRC logger for a while now. The setup consists of an IRC bot and a Flask-based web app, both connected to a Postgres database. I run it on a simple Rockchip machine alongside my Bitcoin network crawler, which also uses Postgres. For some reason every couple of weeks1 the bot would lose its DB connection and have to be manually restarted2 in order to continue eating log lines. Same with the crawler. Potentially there's a performance issue somewhere, causing the connection to timeout and close in some cases. I know it's not the logotron itself, since asciilifeform ran the same code on the same box for almost a year and never had the DB connection issue, but perhaps both apps sharing the same small RK is just too much for it to handle at times.
In any case, it was a problem for me, as well as others—and I figured it couldn't hurt for the bot to have auto-db-reconnect functionality—so I came up with a small patch to remedy the issue. Below is the tidied up version of what I currently have running.
1 | |
2 | DB_Name = cfg.get("db", "db_name") |
3 | DB_User = cfg.get("db", "db_user") |
4 | DB_DEBUG = cfg.get("db", "db_debug") |
5 | + DB_Reconn_Tries = int(cfg.get("db", "db_reconnect_max_tries")) |
6 | + DB_Reconn_Delay = int(cfg.get("db", "db_reconnect_delay")) |
7 | # Logism: |
8 | Base_URL = cfg.get("logotron", "base_url") |
9 | App_Root = cfg.get("logotron", "app_root") |
10 | |
11 | |
12 | ############################################################################## |
13 | |
14 | # Connect to the given DB |
15 | -try: |
16 | - db = psycopg2.connect("dbname=%s user=%s" % (DB_Name, DB_User)) |
17 | -except Exception: |
18 | - print "Could not connect to DB!" |
19 | - logging.error("Could not connect to DB!") |
20 | - exit(1) |
21 | -else: |
22 | - logging.info("Connected to DB!") |
23 | +db = None |
24 | + |
25 | +def conn_db(): |
26 | + global db |
27 | + |
28 | + tries = DB_Reconn_Tries |
29 | + |
30 | + while True: |
31 | + # Connect to the given DB |
32 | + try: |
33 | + db = psycopg2.connect("dbname=%s user=%s" % (DB_Name, DB_User)) |
34 | + except Exception: |
35 | + print "Could not connect to DB!" |
36 | + logging.error("Could not connect to DB!") |
37 | + if tries > 0 or DB_Reconn_Tries == -1: |
38 | + tries = tries - 1 |
39 | + time.sleep(DB_Reconn_Delay) |
40 | + continue |
41 | + else: |
42 | + exit(1) |
43 | + else: |
44 | + logging.info("Connected to DB!") |
45 | + break |
46 | + |
47 | +conn_db() |
48 | |
49 | ############################################################################## |
50 | |
51 | def close_db(): |
52 | db.close() |
53 | |
54 | +def ensure_db_is_alive(): |
55 | + # Ping the db to ensure it's alive and connected |
56 | + logging.debug("Checking DB connection status...") |
57 | + try: |
58 | + cur = db.cursor() |
59 | + cur.execute('SELECT 1') |
60 | + except (psycopg2.OperationalError, psycopg2.InterfaceError) as e: |
61 | + pass |
62 | + |
63 | + # If connection is alive db.closed will equal 0 |
64 | + if db.closed == 0: |
65 | + return True |
66 | + |
67 | + # Otherwise, attempt to reconnect |
68 | + logging.debug("No DB Connection!") |
69 | + conn_db() |
70 | + |
71 | def exec_db(query, args=()): |
72 | cur = db.cursor(cursor_factory=psycopg2.extras.RealDictCursor) |
73 | if (DB_DEBUG): logging.debug("query: '{0}'".format(query)) |
74 | |
75 | def save_line(time, chan, speaker, action, payload): |
76 | ## Put in DB: |
77 | try: |
78 | + ensure_db_is_alive() |
79 | + |
80 | # Get index of THIS new line to be saved |
81 | last_idx = query_db( |
82 | '''select idx from loglines where chan=%s |
With this running locally I was able to bring down the DB, send a few IRC messages, restart the DB, and have the messages that were sent while the DB was offline recorded once the DB came back up. I'm going to give it a few weeks in production to see how it fares. If it does indeed solve the issue (and not cause any others) I'll package it as a vpatch and publish it here along with the rest of the logotron tree.
- Sometimes longer, but also sometimes much shorter. I suspect it has to do with the usage/load of the sites. [↩]
- This happened often enough that I even wrote a small script to pull from asciilifeform's channel logs when mine fell behind:
#!/bin/bash batch_size=500 start_line=$1 end_line=$2 chan_url=$3 results_file=$4 if [ $1 -ge $2 ]; then echo "end line must be greater than the start line" exit 1 fi num_lines=$(( $end_line - $start_line )) num_batches=$(( $num_lines / ($batch_size + 1) + 1 )) last_batch_size=$(( $num_lines % ($batch_size + 1) )) echo "starting at ${start_line}, ending at ${end_line}, for a total of ${num_lines} lines" echo "will download from ${chan_url}?istart=${start_line}&iend=${end_line} in ${num_batches} batch(es). last batch will be ${last_batch_size}" batch=1 while [ $start_line -lt $end_line ]; do if [ $batch -eq $num_batches ]; then cur_end=$(( $start_line + $last_batch_size )) else cur_end=$(( $start_line + $batch_size )) fi echo "pulling ${start_line} - ${cur_end}" curl -s "${chan_url}?istart=${start_line}&iend=${cur_end}" >> $results_file start_line=$(( $cur_end + 1 )) batch=$(( batch + 1 )) done
[↩]
[...] patch just formalizes the fix I added to my bot to deal with the recurring postgres connection timeouts3. Despite only 50% of [...]