情况说明:
ClickHouse 是5个节点的集群,其中一个节点10.103.128.198的数据库ckdata_devops中的network_config_backup表变成只读状态了,导致程序无法写入数据,日志报错。
修复思路:
由于修复元数据状态非常困难,最简单、最可靠的也是ClickHouse官方推荐的恢复方式是在坏掉的节点上,删除并重新创建表,让集群自动同步即可。
具体操作:
1、手动清理 ZooKeeper 中的副本节点
数据表已处于异常状态,但是直接删掉表后重新创建会报错,这是因为ZooKeeper 中的副本节点 /replicas/10.103.128.198 仍然存在(防止重复注册)
进入ZooKeeper集群的其中1个节点进行操作
cd /data/server/zookeeper/bin
./zkCli.sh
#查看 replicas 下的所有副本
ls /clickhouse/tables/01/network_config_backup/replicas
#删除故障节点
delete /clickhouse/tables/01/network_config_backup/replicas/10.103.128.198
#退出
quit
2、在正常节点10.103.128.196上面操作
进入控制台:clickhouse-client --password
show databases; #查看数据库
use ckdata_devops; #进入数据库
show tables; #查看表
#查看表结构
SHOW CREATE TABLE ckdata_devops.network_config_backup;
novalocal SHOW CREATE TABLE ckdata_devops.network_config_backup;
SHOW CREATE TABLE ckdata_devops.network_config_backup
Query id: 23b1f63e-7571-4001-959d-6f2d3b9977e0
┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE ckdata_devops.network_config_backup
(
`devices_id` String,
`devices_name` String,
`resource_ip` String,
`config_url` String,
`upate_time` DateTime,
`version_code` String,
`scenario_type` Int32 DEFAULT 12,
`network_job_id` String,
`config_file_extension` String DEFAULT 'txt',
`pool_id` String,
`exec_id` Nullable(String),
`bucket_name` Nullable(String),
`is_minio` Nullable(String) DEFAULT false,
`is_base_line` Nullable(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/network_config_backup', '{replica}')
PARTITION BY toYYYYMMDD(upate_time)
PRIMARY KEY (devices_id, version_code, scenario_type, pool_id)
ORDER BY (devices_id, version_code, scenario_type, pool_id)
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.002 sec.
我们需要的创建表的语句,后面会用到。
####################################################
CREATE TABLE ckdata_devops.network_config_backup
(
`devices_id` String,
`devices_name` String,
`resource_ip` String,
`config_url` String,
`upate_time` DateTime,
`version_code` String,
`scenario_type` Int32 DEFAULT 12,
`network_job_id` String,
`config_file_extension` String DEFAULT 'txt',
`pool_id` String,
`exec_id` Nullable(String),
`bucket_name` Nullable(String),
`is_minio` Nullable(String) DEFAULT false,
`is_base_line` Nullable(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/network_config_backup', '{replica}')
PARTITION BY toYYYYMMDD(upate_time)
PRIMARY KEY (devices_id, version_code, scenario_type, pool_id)
ORDER BY (devices_id, version_code, scenario_type, pool_id)
SETTINGS index_granularity = 8192;
####################################################
3、在故障节点10.103.128.198上面操作
systemctl stop clickhouse-server #停止服务
#删除坏掉的表数据
rm -rf /data/server/clickhouse/data/ckdata_devops/network_config_backup/
#删除 detached(如果有)
rm -rf /data/server/clickhouse/data/ckdata_devops/detached/network_config_backup/
#检查表数据目录是否清空
ls -l /data/server/clickhouse/data/ckdata_devops/network_config_backup/
#检查 metadata 文件
ls -l /data/server/clickhouse/metadata/ckdata_devops/network_config_backup.sql
systemctl start clickhouse-server #启动服务
进入控制台:clickhouse-client --password
show databases; #查看数据库
use ckdata_devops; #进入数据库
show tables; #查看表
3.1 删除表
DROP TABLE ckdata_devops.network_config_backup;
3.2 用刚才的语句重新创建
#粘贴复制的 CREATE TABLE 语句
CREATE TABLE ckdata_devops.network_config_backup
(
`devices_id` String,
`devices_name` String,
`resource_ip` String,
`config_url` String,
`upate_time` DateTime,
`version_code` String,
`scenario_type` Int32 DEFAULT 12,
`network_job_id` String,
`config_file_extension` String DEFAULT 'txt',
`pool_id` String,
`exec_id` Nullable(String),
`bucket_name` Nullable(String),
`is_minio` Nullable(String) DEFAULT false,
`is_base_line` Nullable(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/network_config_backup', '{replica}')
PARTITION BY toYYYYMMDD(upate_time)
PRIMARY KEY (devices_id, version_code, scenario_type, pool_id)
ORDER BY (devices_id, version_code, scenario_type, pool_id)
SETTINGS index_granularity = 8192;
3.3 触发同步
SYSTEM SYNC REPLICA ckdata_devops.network_config_backup;
3.4 查看当前副本状态
SELECT
replica_name,
is_leader,
is_readonly,
is_session_expired,
queue_size,
parts_to_check,
inserts_in_queue
FROM system.replicas
WHERE table = 'network_config_backup'
AND database = 'ckdata_devops';
┌─replica_name───┬─is_leader─┬─is_readonly─┬─is_session_expired─┬─queue_size─┬─parts_to_check─┬─inserts_in_queue─┐
│ 10.103.128.198 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │
└────────────────┴───────────┴─────────────┴────────────────────┴────────────┴────────────────┴──────────────────┘
parts_to_check 0 ✅ 正常 所有数据块已同步完成
3.5 检查状态
SELECT
is_readonly,
is_session_expired,
total_replicas,
active_replicas,
replica_name,
zookeeper_exception
FROM system.replicas
WHERE table = 'network_config_backup';
3.6 查询 ZooKeeper 副本
SELECT
name AS replica_path
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/network_config_backup/replicas'
┌─registered_replica─┐
│ 10.103.128.201 │
│ 10.103.128.198 │
│ 10.103.128.199 │
│ 10.103.128.196 │
│ 10.103.128.197 │
└────────────────────┘
至此,clickhouse集群已经恢复。



