ClickHouse数据库文件bit-flip修复
最近发现家里的服务器功耗有些高,SSH进去一看,clickhouse-server
进程占满了一个CPU核心。
这个数据库主要存储日志数据,会有客户端持续写入。查看错误日志发现,每隔几秒都报出类似的错误:
DB::Exception: Checksum doesn't match: corrupted data. Reference: 951421a0138404e940e43ab1de8f99c5. Actual: 181ec74e7108c3d94b962d0a2c26c3c1. Size of compressed block: 7762. The mismatch is caused by single bit flip in data block at byte 4957, bit 5.
看起来,数据文件的一个bit翻转了。Google搜到这个项目,下载回来试试。由于日志只给出了问题文件所在的目录,并未精确到文件,所以做好备份、逐个.bin
文件尝试。有一个文件有如下输出:
# ~/clickhouse-bitflip ./var/lib/clickhouse/store/95b/95b401b2-bfe9-49d2-90ff-a2ea7998f124/202402_1341974_1343016_729/value.bin
Checksum doesn't match: corrupted data. Reference: 951421a0138404e940e43ab1de8f99c5. Actual: 181ec74e7108c3d94b962d0a2c26c3c1. Size of compressed block: 7762
The mismatch is caused by single bit flip in data block at byte 4957 , bit 5
backup file to ./var/lib/clickhouse/store/95b/95b401b2-bfe9-49d2-90ff-a2ea7998f124/202402_1341974_1343016_729/value.bin.bak
Fixing in file at byte 645038. Old value: 0xb9. New value: 0x99
Completed. Corrected errors: 1 . Uncorrected errors 0
有一个字节从1001 1001
翻转为了1011 1001
。改回来后checksum变一致了。
重启ClickHouse服务器,不再报错、CPU占用恢复正常。问题解决。