技术交流QQ群:①185473046   ②190706903   ③203744115   网站地图
登录

下次自动登录
现在的位置: 首页clickhouse>正文
clickhouse集群硬件故障导致节点无法启动修复办法
2025年07月18日 clickhouse 暂无评论 ⁄ 被围观 21次+

问题描述:

线上5节点的clickhouse集群,其中有1个节点服务器内存坏掉了,导致服务器无故重启,更换新的内存启动服务器后,clickhouse服务无法启动,节点故障。

排查解决:

1、登录故障节点服务器,开启clickhouse日志

vi /etc/clickhouse-server/config.xml

<log>/data/server/clickhouse/log/clickhouse-server.log</log> #主日志文件,记录 ClickHouse 的运行日志(包括启动、查询、加载表等)

<errorlog>/data/server/clickhouse/log/clickhouse-server.err.log</errorlog> #错误日志文件,记录严重错误、异常堆栈等信息

:wq! #保存退出

chmod 755 /etc/clickhouse-server -R #设置权限

2、启动并查看最新的日志

systemctl stop clickhouse-server #关闭

systemctl start clickhouse-server #启动服务

tail -n 200 /data/server/clickhouse/log/clickhouse-server.err.log #查看最新的200条错误日志记录

#日志如下

7. ./build_docker/./src/Storages/MergeTree/MergeTreeData.cpp:1440: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::MergeTreeData::loadDataPartsFromDisk(ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>&, unsigned long, std::queue<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>, std::deque<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>, std::allocator<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>>>>&, std::shared_ptr<DB::MergeTreeSettings const> const&)::$_19, void ()>>(std::__function::__policy_storage const*) @ 0x14594b44 in /usr/lib/debug/usr/bin/clickhouse.debug

8. ./build_docker/./base/base/../base/wide_integer_impl.h:789: ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0xe2b5625 in /usr/lib/debug/usr/bin/clickhouse.debug

9. ./build_docker/./src/Common/ThreadPool.cpp:0: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, long, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0xe2b8195 in /usr/lib/debug/usr/bin/clickhouse.debug

10. ./build_docker/./base/base/../base/wide_integer_impl.h:789: ThreadPoolImpl<std::thread>::worker(std::__list_iterator<std::thread, void*>) @ 0xe2b13f3 in /usr/lib/debug/usr/bin/clickhouse.debug

11. ./build_docker/./contrib/llvm-project/libcxx/include/__memory/unique_ptr.h:302: void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, long, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0xe2b7061 in /usr/lib/debug/usr/bin/clickhouse.debug

12. ? @ 0x8f1b in /usr/lib64/libpthread-2.28.so

13. clone @ 0xf833f in /usr/lib64/libc-2.28.so

(version 23.3.2.37 (official build))

2025.07.17 22:12:52.094227 [ 2124980 ] {} <Error> autoops_workbench.inspection_hardware_exec_host_history_distributed (92f2f288-f098-4e5e-acc3-ad4dede208c4): Detaching broken part /data/server/clickhouse/store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66576_66576_0 (size: 0.00 B). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually

2025.07.17 22:12:52.094396 [ 2124980 ] {} <Error> autoops_workbench.inspection_hardware_exec_host_history_distributed (92f2f288-f098-4e5e-acc3-ad4dede208c4): while loading part 20250703_66575_66575_0 on path store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66575_66575_0: Code: 27. DB::ParsingException: Cannot parse input: expected 'columns format version: 1\n' at end of stream. (CANNOT_PARSE_INPUT_ASSERTION_FAILED), Stack trace (when copying this message, always include the lines below):

0. ./build_docker/./src/Common/Exception.cpp:91: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe1e20b5 in /usr/lib/debug/usr/bin/clickhouse.debug

1. ./build_docker/./contrib/llvm-project/libcxx/include/string:1499: DB::ParsingException::ParsingException<String&>(int, FormatStringHelperImpl<std::type_identity<String&>::type>, String&) @ 0xe2407c4 in /usr/lib/debug/usr/bin/clickhouse.debug

2. ./build_docker/./src/IO/ReadHelpers.cpp:103: DB::throwAtAssertionFailed(char const*, DB::ReadBuffer&) @ 0xe2406c1 in /usr/lib/debug/usr/bin/clickhouse.debug

3. ./build_docker/./src/IO/ReadBuffer.h:68: DB::NamesAndTypesList::readText(DB::ReadBuffer&) @ 0x127e5438 in /usr/lib/debug/usr/bin/clickhouse.debug

4. ./build_docker/./contrib/llvm-project/libcxx/include/list:621: DB::IMergeTreeDataPart::loadColumns(bool) @ 0x1448a338 in /usr/lib/debug/usr/bin/clickhouse.debug

5. ./build_docker/./src/Storages/MergeTree/IMergeTreeDataPart.cpp:623: DB::IMergeTreeDataPart::loadColumnsChecksumsIndexes(bool, bool) @ 0x14489c6a in /usr/lib/debug/usr/bin/clickhouse.debug

6. ./build_docker/./src/Storages/MergeTree/MergeTreeData.cpp:0: DB::MergeTreeData::loadDataPart(DB::MergeTreePartInfo const&, String const&, std::shared_ptr<DB::IDisk> const&, DB::MergeTreeDataPartState, std::mutex&) @ 0x14507df5 in /usr/lib/debug/usr/bin/clickhouse.debug

7. ./build_docker/./src/Storages/MergeTree/MergeTreeData.cpp:1440: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::MergeTreeData::loadDataPartsFromDisk(ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>&, unsigned long, std::queue<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>, std::deque<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>, std::allocator<std::vector<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>, std::allocator<std::shared_ptr<DB::MergeTreeData::PartLoadingTree::Node>>>>>>&, std::shared_ptr<DB::MergeTreeSettings const> const&)::$_19, void ()>>(std::__function::__policy_storage const*) @ 0x14594b44 in /usr/lib/debug/usr/bin/clickhouse.debug

8. ./build_docker/./base/base/../base/wide_integer_impl.h:789: ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0xe2b5625 in /usr/lib/debug/usr/bin/clickhouse.debug

9. ./build_docker/./src/Common/ThreadPool.cpp:0: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, long, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0xe2b8195 in /usr/lib/debug/usr/bin/clickhouse.debug

10. ./build_docker/./base/base/../base/wide_integer_impl.h:789: ThreadPoolImpl<std::thread>::worker(std::__list_iterator<std::thread, void*>) @ 0xe2b13f3 in /usr/lib/debug/usr/bin/clickhouse.debug

11. ./build_docker/./contrib/llvm-project/libcxx/include/__memory/unique_ptr.h:302: void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, long, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0xe2b7061 in /usr/lib/debug/usr/bin/clickhouse.debug

12. ? @ 0x8f1b in /usr/lib64/libpthread-2.28.so

13. clone @ 0xf833f in /usr/lib64/libc-2.28.so

(version 23.3.2.37 (official build))

2025.07.17 22:12:52.094516 [ 2124980 ] {} <Error> autoops_workbench.inspection_hardware_exec_host_history_distributed (92f2f288-f098-4e5e-acc3-ad4dede208c4): Detaching broken part /data/server/clickhouse/store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66575_66575_0 (size: 0.00 B). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually

2025.07.17 22:12:52.201580 [ 2123926 ] {} <Error> Application: Caught exception while loading metadata: Code: 231. DB::Exception: Suspiciously many (167 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100. You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value): Cannot attach table `autoops_workbench`.`inspection_hardware_chart_history_distributed` from metadata file /data/server/clickhouse/store/9dc/9dc2573a-8292-4472-8e7f-8fe17c7e9b0b/inspection_hardware_chart_history_distributed.sql from query ATTACH TABLE autoops_workbench.inspection_hardware_chart_history_distributed UUID '7b6c30a1-91cb-458b-ae03-5acda452c66a' (`host_id` String, `exec_id` String, `insp_type` String, `insp_item` String, `insp_item_name` String, `insp_item_status` String, `insp_item_value` String, `insp_item_unit` String, `insp_datetime` DateTime, `insp_date` Date, `pool_id` String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inspection_hardware_chart_history_distributed', '{replica}') PARTITION BY toYYYYMM(insp_datetime) ORDER BY (host_id, insp_datetime) SETTINGS index_granularity = 8192. (TOO_MANY_UNEXPECTED_DATA_PARTS), Stack trace (when copying this message, always include the lines below):

0. ./build_docker/./src/Common/Exception.cpp:91: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe1e20b5 in /usr/lib/debug/usr/bin/clickhouse.debug

1. ./build_docker/./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::Exception<unsigned long&, String, DB::SettingFieldNumber<unsigned long> const&>(int, FormatStringHelperImpl<std::type_identity<unsigned long&>::type, std::type_identity<String>::type, std::type_identity<DB::SettingFieldNumber<unsigned long> const&>::type>, unsigned long&, String&&, DB::SettingFieldNumber<unsigned long> const&) @ 0x14514c7c in /usr/lib/debug/usr/bin/clickhouse.debug

2. ./build_docker/./src/Storages/MergeTree/MergeTreeData.cpp:0: DB::MergeTreeData::loadDataParts(bool) @ 0x145137a2 in /usr/lib/debug/usr/bin/clickhouse.debug

3. ./build_docker/./src/Storages/StorageReplicatedMergeTree.cpp:0: DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::defa

#从日志中可以看到这几条信息

2025.07.17 22:12:52.094516 [ 2124980 ] {} <Error> autoops_workbench.inspection_hardware_exec_host_history_distributed (92f2f288-f098-4e5e-acc3-ad4dede208c4): Detaching broken part /data/server/clickhouse/store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66575_66575_0 (size: 0.00 B). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually

2025.07.17 22:12:52.201580 [ 2123926 ] {} <Error> Application: Caught exception while loading metadata: Code: 231. DB::Exception: Suspiciously many (167 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100. You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value): Cannot attach table `autoops_workbench`.`inspection_hardware_chart_history_distributed` from metadata file /data/server/clickhouse/store/9dc/9dc2573a-8292-4472-8e7f-8fe17c7e9b0b/inspection_hardware_chart_history_distributed.sql from query ATTACH TABLE autoops_workbench.inspection_hardware_chart_history_distributed UUID '7b6c30a1-91cb-458b-ae03-5acda452c66a' (`host_id` String, `exec_id` String, `insp_type` String, `insp_item` String, `insp_item_name` String, `insp_item_status` String, `insp_item_value` String, `insp_item_unit` String, `insp_datetime` DateTime, `insp_date` Date, `pool_id` String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/inspection_hardware_chart_history_distributed', '{replica}') PARTITION BY toYYYYMM(insp_datetime) ORDER BY (host_id, insp_datetime) SETTINGS index_granularity = 8192. (TOO_MANY_UNEXPECTED_DATA_PARTS), Stack trace (when copying this message, always include the lines below):

3、日志分析

3.1ClickHouse 检测到一个损坏的 part,路径为:

/data/server/clickhouse/store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66575_66575_0

该 part 的大小为 0 字节,说明数据不完整或损坏,导致ClickHouse无法启动。

ClickHouse 自动 detach(分离)了这个 part,但不会自动删除。

3.2有太多损坏的parts

2025.07.17 22:12:52.201580 [ 2123926 ] {} <Error> Application: Caught exception while loading metadata: Code: 231. DB::Exception: Suspiciously many (167 parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100. You can change the maximum value with merge tree setting 'max_suspicious_broken_parts' in <merge_tree> configuration section or in table settings in .sql file (don't forget to return setting back to default value)

ClickHouse 在加载表时发现 167 个损坏的 part。

默认最多允许 100 个损坏的 part,超过后会 拒绝加载表,导致ClickHouse无法启动。

4、解决办法

4.1删除损坏的part

rm -rf /data/server/clickhouse/store/92f/92f2f288-f098-4e5e-acc3-ad4dede208c4/20250703_66575_66575_0

4.2修改配置文件,增大允许的损坏 part 数量

默认值是100,我们坏掉了167个,现在修改为200个,大于167

注意:修改后记得在问题解决后恢复为默认值 100,避免长期容忍大量损坏数据

vi /etc/clickhouse-server/config.xml

<merge_tree>

<max_suspicious_broken_parts>200</max_suspicious_broken_parts>

</merge_tree>

:wq! #保存退出

chmod 755 /etc/clickhouse-server -R #设置权限

4.3重启clickhouse服务

systemctl restart clickhouse-server

服务启动成功

5、验证集群

clickhouse-client --password #登录客户端

DESCRIBE TABLE autoops_workbench.mv_inspection_hardware_exec_host_history_distributed; #查看表结构

SELECT count(*) FROM mv_inspection_hardware_exec_host_history_distributed; #查看表数据

6、检查表是否使用副本(ReplicatedMergeTree):

SHOW CREATE TABLE autoops_workbench.mv_inspection_hardware_exec_host_history_distributed;

集群是5节点5个副本的,刚才删掉的part数据会从其他正常的副本同步回来,无需担心数据丢失。

我们使用了副本机制(ReplicatedMergeTree),ClickHouse 会在后台从其他副本同步数据,确保数据一致性。

至此,clickhouse集群硬件故障导致节点无法启动修复完成。

     

  系统运维技术交流QQ群:①185473046 系统运维技术交流□Ⅰ ②190706903 系统运维技术交流™Ⅱ ③203744115 系统运维技术交流™Ⅲ

给我留言

您必须 [ 登录 ] 才能发表留言!



Copyright© 2011-2025 系统运维 All rights reserved
版权声明:本站所有文章均为作者原创内容,如需转载,请注明出处及原文链接
陕ICP备11001040号-3