基於Patroni的Citus高可用環境部署
1. 前言
Citus是一個非常實用的能夠使PostgreSQL具有進行水平擴展能力的插件,或者說是一款以PostgreSQL插件形式部署的基於PostgreSQL的分佈式HTAP數據庫。本文簡單說明Citus的高可用技術方案,並實際演示基於Patroni搭建Citus HA環境的步驟。
2. 技術方案
2.1 Citus HA方案選型
Citus集群由一個CN節點和N個Worker節點組成。CN節點的高可用可以使用任何通用的PG 高可用方案,即為CN節點通過流複製配置主備2臺PG機器;Worker節點的高可用除了可以像CN一樣採用PG原生的高可用方案,還支持另一種多副本分片的高可用方案。
多副本高可用方案是Citus早期版本默認的Worker高可用方案(當時citus.shard_count
默認值為2),這種方案部署非常簡單,而且壞一個Worker節點也不影響業務。採用多副本高可用方案時,每次寫入數據,CN節點需要在2個Worker上分別寫數據,這也帶來一系列不利的地方。
- 數據寫入的性能下降
- 對多個副本的數據一致性的保障也沒有PG原生的流複製強
- 存在功能上的限制,比如不支持Citus MX架構
因此,Citus的多副本高可用方案適用場景有限,Citus 官方文檔上也說可能它只適用於append only的業務場景,不作為推薦的高可用方案了(在Citus 6.1的時候,citus.shard_count
默認值從2改成了1)。
因此,建議Citus和CN和Worker節點都使用PG的原生流複製部署高可用。
2.2 PG HA支持工具的選型
PG本身提供的流複製的HA的部署和維護都不算很複雜,但是如果我們追求更高程度的自動化,特別是自動故障切換,可以使用一些使用第3方的HA工具。目前有很多種可選的開源工具,下面幾種算是比較常用的
- PAF(PostgreSQL Automatic Failover)
- repmgr
- Patroni
它們的比較可以參考: https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/
其中Patroni採用DCS(Distributed Configuration Store,比如etcd,ZooKeeper,Consul等)存儲元數據,能夠嚴格的保障元數據的一致性,可靠性高;而且它的功能也比較強大。
因此個人推薦使用Patroni(只有2臺機器無法部署etcd的情況可以考慮其它方案)。本文介紹基於Patroni的PostgreSQL高可用的部署。
2.3 客戶端流量切換方案
PG 主備切換後,訪問數據庫的客戶端也要相應地連接到新的主庫。目前常見的有下面幾種方案:
-
HAProxy
-
優點
-
- 可靠
- 支持負載均衡
-
缺點
- 性能損耗
- 需要配置HAProxy自身的HA
-
VIP
-
優點
- 無性能損耗,不佔用機器資源
-
缺點
- 主備節點IP必須在同網段
客戶端多主機URL
-
優點
- 無性能損耗,不佔用機器資源
- 不依賴VIP,易於在雲環境部署
- pgjdbc支持讀寫分離和負載均衡
-
缺點
- 僅部分客戶端驅動支持(目前包括pgjdbc,libpq和基於libpq的驅動,如python和php)
- 如果數據庫層面沒控制好出現了"雙主", 客戶端同時向2個主寫數據的風險較高
根據Citus集群的特點,推薦的候選方案如下
-
應用連接Citus
- 客戶端多主機URL
如果客戶端驅動支持,特別對Java應用,推薦採用客戶端多主機URL訪問Citus
- VIP
-
Citus CN連接Worker
- VIP
- Worker節點發生切換時動態修改Citus CN上的worker節點元數據
關於Citus CN連接Worker的方式,本文下面的實驗中會演示2種架構,採用不同的實現方式。
普通架構
- CN通過Worker的實際IP連接Worekr主節點
- CN上通過監控腳本檢測Worker節點狀態,Worker發生主備切換時動態修改Citus CN上的元數據
支持讀寫分離的架構
- CN通過Worker的讀寫VIP和只讀VIP連接Worekr
- CN上通過Patroni回調腳本動態控制CN主節點使用讀寫VIP,CN備節點使用只讀VIP
- Worker上通過Patroni回調腳本動態綁定讀寫VIP
- Worker上通過keepalived動態綁定只讀VIP
3. 實驗環境
主要軟件
- CentOS 7.8
- PostgreSQL 12
- Citus 10.4
- patroni 1.6.5
- etcd 3.3.25
機器和VIP資源
-
Citus CN
- node1:192.168.234.201
- node2:192.168.234.202
-
Citus Worker
- node3:192.168.234.203
- node4:192.168.234.204
-
etcd
- node4:192.168.234.204
-
VIP(Citus CN )
- 讀寫VIP:192.168.234.210
- 只讀VIP:192.168.234.211
環境準備
所有節點設置時鐘同步
yum install -y ntpdate
ntpdate time.windows.com && hwclock -w
如果使用防火牆需要開放postgres,etcd和patroni的端口。
- postgres:5432
- patroni:8008
- etcd:2379/2380
更簡單的做法是將防火牆關閉
setenforce 0
sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
systemctl disable firewalld.service
systemctl stop firewalld.service
iptables -F
4. etcd部署
因為本文的主題不是etcd的高可用,所以只在node4上部署單節點的etcd用於實驗。生產環境至少需要3臺獨立的機器,也可以和數據庫部署在一起。etcd的部署步驟如下
安裝需要的包
yum install -y gcc python-devel epel-release
安裝etcd
yum install -y etcd
編輯etcd配置文件/etc/etcd/etcd.conf
, 參考配置如下
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://192.168.234.204:2380"
ETCD_LISTEN_CLIENT_URLS="http://localhost:2379,http://192.168.234.204:2379"
ETCD_NAME="etcd0"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.234.204:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.234.204:2379"
ETCD_INITIAL_CLUSTER="etcd0=http://192.168.234.204:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster1"
ETCD_INITIAL_CLUSTER_STATE="new"
啟動etcd
systemctl start etcd
設置etcd自啟動
systemctl enable etcd
5. PostgreSQL + Citus + Patroni HA部署
在需要運行PostgreSQL的實例上安裝相關軟件
安裝PostgreSQL 12和Citus
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install -y postgresql12-server postgresql12-contrib
yum install -y citus_12
安裝Patroni
yum install -y gcc epel-release
yum install -y python-pip python-psycopg2 python-devel
pip install --upgrade pip
pip install --upgrade setuptools
pip install patroni[etcd]
創建PostgreSQL數據目錄
mkdir -p /pgsql/data
chown postgres:postgres -R /pgsql
chmod -R 700 /pgsql/data
創建Partoni的service配置文件/etc/systemd/system/patroni.service
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target
[Service]
Type=simple
User=postgres
Group=postgres
#StandardOutput=syslog
ExecStart=/usr/bin/patroni /etc/patroni.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=no
[Install]
WantedBy=multi-user.target
創建Patroni配置文件/etc/patroni.yml
,以下是node1的配置示例
scope: cn
namespace: /service/
name: pg1
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.234.201:8008
etcd:
host: 192.168.234.204:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
master_start_timeout: 300
synchronous_mode: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
listen_addresses: "0.0.0.0"
port: 5432
wal_level: logical
hot_standby: "on"
wal_keep_segments: 1000
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
max_connections: "100"
max_prepared_transactions: "100"
shared_preload_libraries: "citus"
citus.node_conninfo: "sslmode=prefer"
citus.replication_model: streaming
citus.task_assignment_policy: round-robin
initdb:
- encoding: UTF8
- locale: C
- lc-ctype: zh_CN.UTF-8
- data-checksums
pg_hba:
- host replication repl 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
postgresql:
listen: 0.0.0.0:5432
connect_address: 192.168.234.201:5432
data_dir: /pgsql/data
bin_dir: /usr/pgsql-12/bin
authentication:
replication:
username: repl
password: "123456"
superuser:
username: postgres
password: "123456"
basebackup:
max-rate: 100M
checkpoint: fast
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
其他PG節點的patroni.yml需要相應修改下面4個參數
-
scope
- node1,node2設置為cn
- node3,node4設置為wk1
-
name
-
node1~node4
分別設置pg1~pg4
-
-
restapi.connect_address
- 根據各自節點IP設置
-
postgresql.connect_address
- 根據各自節點IP設置
啟動Patroni
在所有節點上啟動Patroni。
systemctl start patroni
同一個cluster中,第一次啟動的Patroni實例會作為leader運行,並初始創建PostgreSQL實例和用戶。後續節點初次啟動時從leader節點克隆數據
查看cn集群狀態
[root@node1 ~]# patronictl -c /etc/patroni.yml list
+ Cluster: cn (6869267831456178056) +---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+--------+-----------------+--------+---------+----+-----------+-----------------+
| pg1 | 192.168.234.201 | | running | 1 | 0.0 | * |
| pg2 | 192.168.234.202 | Leader | running | 1 | | |
+--------+-----------------+--------+---------+----+-----------+-----------------+
查看wk1集群狀態
[root@node3 ~]# patronictl -c /etc/patroni.yml list
+ Cluster: wk1 (6869267726994446390) ---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+--------+-----------------+--------+---------+----+-----------+-----------------+
| pg3 | 192.168.234.203 | | running | 1 | 0.0 | * |
| pg4 | 192.168.234.204 | Leader | running | 1 | | |
+--------+-----------------+--------+---------+----+-----------+-----------------+
為了方便日常操作,設置全局環境變量PATRONICTL_CONFIG_FILE
echo 'export PATRONICTL_CONFIG_FILE=/etc/patroni.yml' >/etc/profile.d/patroni.sh
添加以下環境變量到~postgres/.bash_profile
export PGDATA=/pgsql/data
export PATH=/usr/pgsql-12/bin:$PATH
設置postgres擁有sudoer權限
echo 'postgres ALL=(ALL) NOPASSWD: ALL'> /etc/sudoers.d/postgres
5. 配置Citus
在cn和wk的主節點上創建citus擴展
create extension citus
在cn的主節點上,添加wk1的主節點IP,groupid設置為1。
SELECT * from master_add_node('192.168.234.204', 5432, 1, 'primary');
在Worker的主備節點上分別修改/pgsql/data/pg_hba.conf
配置文件,以下內容添加到其它配置項前面允許CN免密連接Worker。
host all all 192.168.234.201/32 trust
host all all 192.168.234.202/32 trust
修改後重新加載配置
su - postgres
pg_ctl reload
注:也可以通過在CN上設置~postgres/.pgpass
實現免密,但是沒有上面的方式維護方便。
創建分片表測試驗證
create table tb1(id int primary key,c1 text);
set citus.shard_count = 64;
select create_distributed_table('tb1','id');
select * from tb1;
6. 配置Worker的自動流量切換
上面配置的Worker IP是當時的Worker主節點IP,在Worker發生主備切換後,需要相應更新這個IP。
實現上,可以通過腳本監視Worker主備狀態,當Worker主備角色變更時,自動更新Citus上的Worker元數據為新主節點的IP。下面是腳本的參考實現
將以下配置添加到Citus CN主備節點的/etc/patroni.yml
裡
citus:
loop_wait: 10
databases:
- postgres
workers:
- groupid: 1
nodes:
- 192.168.234.203:5432
- 192.168.234.204:5432
也可以使用獨立的配置文件,如果那樣做需要補充認證配置
postgresql:
connect_address: 192.168.234.202:5432
authentication:
superuser:
username: postgres
password: "123456"
創建worker流量自動切換腳本/pgsql/citus_controller.py
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import os
import time
import argparse
import logging
import yaml
import psycopg2
def get_pg_role(url):
result = 'unknow'
try:
with psycopg2.connect(url, connect_timeout=2) as conn:
conn.autocommit = True
cur = conn.cursor()
cur.execute("select pg_is_in_recovery()")
row = cur.fetchone()
if row[0] == True:
result = 'secondary'
elif row[0] == False:
result = 'primary'
except Exception as e:
logging.debug('get_pg_role() failed. url:{0} error:{1}'.format(
url, str(e)))
return result
def update_worker(url, role, groupid, nodename, nodeport):
logging.debug('call update worker. role:{0} groupid:{1} nodename:{2} nodeport:{3}'.format(
role, groupid, nodename, nodeport))
try:
sql = "select nodeid,nodename,nodeport from pg_dist_node where groupid={0} and noderole = '{1}' order by nodeid limit 1".format(
groupid, role)
conn = psycopg2.connect(url, connect_timeout=2)
conn.autocommit = True
cur = conn.cursor()
cur.execute(sql)
row = cur.fetchone()
if row is None:
logging.error("can not found nodeid whose groupid={0} noderole = '{1}'".format(groupid, role))
return False
nodeid = row[0]
oldnodename = row[1]
oldnodeport = str(row[2])
if oldnodename == nodename and oldnodeport == nodeport:
logging.debug('skip for current nodename:nodeport is same')
return False
sql= "select master_update_node({0}, '{1}', {2})".format(nodeid, nodename, nodeport)
ret = cur.execute(sql)
logging.info("Changed worker node {0} from '{1}:{2}' to '{3}:{4}'".format(nodeid, oldnodename, oldnodeport, nodename, nodeport))
return True
except Exception as e:
logging.error('update_worker() failed. role:{0} groupid:{1} nodename:{2} nodeport:{3} error:{4}'.format(
role, groupid, nodename, nodeport, str(e)))
return False
def main():
parser = argparse.ArgumentParser(description='Script to auto setup Citus worker')
parser.add_argument('-c', '--config', default='citus_controller.yml')
parser.add_argument('-d', '--debug', action='store_true', default=False)
args = parser.parse_args()
if args.debug:
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.DEBUG)
else:
logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s', level=logging.INFO)
# read config file
f = open(args.config,'r')
contents = f.read()
config = yaml.load(contents, Loader=yaml.FullLoader)
cn_connect_address = config['postgresql']['connect_address']
username = config['postgresql']['authentication']['superuser']['username']
password = config['postgresql']['authentication']['superuser']['password']
databases = config['citus']['databases']
workers = config['citus']['workers']
loop_wait = config['citus'].get('loop_wait',10)
logging.info('start main loop')
loop_count = 0
while True:
loop_count += 1
logging.debug("##### main loop start [{}] #####".format(loop_count))
dbname = databases[0]
cn_url = "postgres://{0}/{1}?user={2}&password={3}".format(
cn_connect_address,dbname,username,password)
if(get_pg_role(cn_url) == 'primary'):
for worker in workers:
groupid = worker['groupid']
nodes = worker['nodes']
## get role of worker nodes
primarys = []
secondarys = []
for node in nodes:
wk_url = "postgres://{0}/{1}?user={2}&password={3}".format(
node,dbname,username,password)
role = get_pg_role(wk_url)
if role == 'primary':
primarys.append(node)
elif role == 'secondary':
secondarys.append(node)
logging.debug('Role info groupid:{0} primarys:{1} secondarys:{2}'.format(
groupid,primarys,secondarys))
## update worker node
for dbname in databases:
cn_url = "postgres://{0}/{1}?user={2}&password={3}".format(
cn_connect_address,dbname,username,password)
if len(primarys) == 1:
nodename = primarys[0].split(':')[0]
nodeport = primarys[0].split(':')[1]
update_worker(cn_url, 'primary', groupid, nodename, nodeport)
"""
Citus的pg_dist_node元數據中要求nodename:nodeport必須唯一,所以無法同時支持secondary節點的動態更新。
一個可能的迴避方法是為每個worker配置2個IP地址,一個作為parimary角色時使用,另一個作為secondary角色時使用。
if len(secondarys) >= 1:
nodename = secondarys[0].split(':')[0]
nodeport = secondarys[0].split(':')[1]
update_worker(cn_url, 'secondary', groupid, nodename, nodeport)
elif len(secondarys) == 0 and len(primarys) == 1:
nodename = primarys[0].split(':')[0]
nodeport = primarys[0].split(':')[1]
update_worker(cn_url, 'secondary', groupid, nodename, nodeport)
"""
time.sleep(loop_wait)
if __name__ == '__main__':
main()
創建該腳本的service配置文件/etc/systemd/system/citus_controller.service
[Unit]
Description=Auto update primary worker ip in Citus CN
After=syslog.target network.target
[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/bin/python /pgsql/citus_controller.py -c /etc/patroni.yml
KillMode=process
TimeoutSec=30
Restart=no
[Install]
WantedBy=multi-user.target
在cn主備節點上都啟動Worker流量自動切換腳本
systemctl start citus_controller
7. 讀寫分離
根據上面的配置,Citus CN不會訪問Worker的備機,這些備機閒著也是閒著,能否把這些備節用起來,讓Citus CN支持讀寫分離呢?具體而言就是讓CN的備機優先訪問Worker的備機,Worker備節故障時訪問Worker的主機。
Citus本身支持讀寫分離功能,可以把一個Worker的主備2個節點作為2個”worker"分別以primary
和secondary
的角色加入到同一個worker group裡。但是,由於Citus的pg_dist_node
元數據中要求nodename:nodeport必須唯一,所以前面的動態修改Citus元數據中的worker IP的方式無法同時支持primary節點和secondary節點的動態更新。
解決辦法有2個
方法1:Citus元數據中只寫固定的主機名,比如wk1,wk2...,然後通過自定義的Worker流量自動切換腳本將這個固定的主機名解析成不同的IP地址寫入到/etc/hosts
裡,在CN主庫上解析成Worker主庫的IP,在CN備庫上解析成Worker備庫的IP。
方法2:在Worker上動態綁定讀寫VIP和只讀VIP。在Citus元數據中讀寫VIP作為primary角色的worker,只讀VIP作為secondary角色的worker。
Patroni動態綁VIP的方法參考基於Patroni的PostgreSQL高可用環境部署.md,對Citus Worker,讀寫VIP通過回調腳本動態綁定;只讀VIP通過keepalived動態綁定。
下面按方法2進行配置。
創建Citus集群時,在CN的主節點上,添加wk1的讀寫VIP(192.168.234.210)和只讀VIP(192.168.234.211),分別作為primary
worker和secondary
worker,groupid設置為1。
SELECT * from master_add_node('192.168.234.210', 5432, 1, 'primary');
SELECT * from master_add_node('192.168.234.211', 5432, 1, 'secondary');
為了讓CN備庫連接到secondary的worker,還需要在CN備庫上設置以下參數
alter system set citus.use_secondary_nodes=always;
select pg_reload_conf();
這個參數的變更只對新創建的會話生效,如果希望立即生效,需要在修改參數後殺掉已有會話。
現在分別到CN主庫和備庫上執行同一條SQL,可以看到SQL被髮往不同的worker。
CN主庫(未設置citus.use_secondary_nodes=always
):
postgres=# explain select * from tb1;
QUERY PLAN
-------------------------------------------------------------------------------
Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=192.168.234.210 port=5432 dbname=postgres
-> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36)
(6 rows)
CN備庫(設置了citus.use_secondary_nodes=always
):
postgres=# explain select * from tb1;
QUERY PLAN
-------------------------------------------------------------------------------
Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=192.168.234.211 port=5432 dbname=postgres
-> Seq Scan on tb1_102168 tb1 (cost=0.00..22.70 rows=1270 width=36)
(6 rows)
由於CN也會發生主備切換,`citus.use_secondary_nodes
參數必須動態調節。這可以使用Patroni的回調腳本實現
創建動態設置參數的/pgsql/switch_use_secondary_nodes.sh
#!/bin/bash
DBNAME=postgres
KILL_ALL_SQL="select pg_terminate_backend(pid) from pg_stat_activity where backend_type='client backend' and application_name <> 'Patroni' and pid <> pg_backend_pid()"
action=$1
role=$2
cluster=$3
log()
{
echo "switch_use_secondary_nodes: $*"|logger
}
alter_use_secondary_nodes()
{
value="$1"
oldvalue=`psql -d postgres -Atc "show citus.use_secondary_nodes"`
if [ "$value" = "$oldvalue" ] ; then
log "old value of use_secondary_nodes already be '${value}', skip change"
return
fi
psql -d ${DBNAME} -c "alter system set citus.use_secondary_nodes=${value}" >/dev/null
rc=$?
if [ $rc -ne 0 ] ;then
log "fail to alter use_secondary_nodes to '${value}' rc=$rc"
exit 1
fi
psql -d ${DBNAME} -c 'select pg_reload_conf()' >/dev/null
rc=$?
if [ $rc -ne 0 ] ;then
log "fail to call pg_reload_conf() rc=$rc"
exit 1
fi
log "changed use_secondary_nodes to '${value}'"
## kill all existing connections
killed_conns=`psql -d ${DBNAME} -Atc "${KILL_ALL_SQL}" | wc -l`
rc=$?
if [ $rc -ne 0 ] ;then
log "failed to kill connections rc=$rc"
exit 1
fi
log "killed ${killed_conns} connections"
}
log "switch_use_secondary_nodes start args:'$*'"
case $action in
on_start|on_restart|on_role_change)
case $role in
master)
alter_use_secondary_nodes never
;;
replica)
alter_use_secondary_nodes always
;;
*)
log "wrong role '$role'"
exit 1
;;
esac
;;
*)
log "wrong action '$action'"
exit 1
;;
esac
修改Patroni配置文件/etc/patroni.yml
,配置回調函數
postgresql:
...
callbacks:
on_start: /bin/bash /pgsql/switch_use_secondary_nodes.sh
on_restart: /bin/bash /pgsql/switch_use_secondary_nodes.sh
on_role_change: /bin/bash /pgsql/switch_use_secondary_nodes.sh
所有節點的Patroni配置文件都修改後,重新加載Patroni配置
patronictl reload cn
CN上執行switchover後,可以看到use_secondary_nodes
參數發生了修改
/var/log/messages:
Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: switch_use_secondary_nodes start args:'on_role_change replica cn'
Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: changed use_secondary_nodes to 'always'
Sep 10 00:10:25 node2 postgres: switch_use_secondary_nodes: killed 0 connections
8. 參考
- 基於Patroni的PostgreSQL高可用環境部署.md
- 《基於Patroni的Citus高可用方案》(PostgreSQL中國用戶大會2019分享主題)
- https://patroni.readthedocs.io/en/latest/
- http://blogs.sungeek.net/unixwiz/2018/09/02/centos-7-postgresql-10-patroni/
- https://scalegrid.io/blog/managing-high-availability-in-postgresql-part-1/
- https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters
- https://www.percona.com/blog/2019/10/23/seamless-application-failover-using-libpq-features-in-postgresql/