Keuin's

ClickHouse数据库文件bit-flip修复

最近发现家里的服务器功耗有些高,SSH进去一看,clickhouse-server进程占满了一个CPU核心。

这个数据库主要存储日志数据,会有客户端持续写入。查看错误日志发现,每隔几秒都报出类似的错误:

DB::Exception: Checksum doesn't match: corrupted data. Reference: 951421a0138404e940e43ab1de8f99c5. Actual: 181ec74e7108c3d94b962d0a2c26c3c1. Size of compressed block: 7762. The mismatch is caused by single bit flip in data block at byte 4957, bit 5.

看起来,数据文件的一个bit翻转了。Google搜到这个项目,下载回来试试。由于日志只给出了问题文件所在的目录,并未精确到文件,所以做好备份、逐个.bin文件尝试。有一个文件有如下输出:

# ~/clickhouse-bitflip ./var/lib/clickhouse/store/95b/95b401b2-bfe9-49d2-90ff-a2ea7998f124/202402_1341974_1343016_729/value.bin
Checksum doesn't match: corrupted data. Reference: 951421a0138404e940e43ab1de8f99c5. Actual: 181ec74e7108c3d94b962d0a2c26c3c1. Size of compressed block: 7762
    The mismatch is caused by single bit flip in data block at byte 4957 , bit 5
    backup file to ./var/lib/clickhouse/store/95b/95b401b2-bfe9-49d2-90ff-a2ea7998f124/202402_1341974_1343016_729/value.bin.bak
    Fixing in file at byte 645038. Old value: 0xb9. New value: 0x99
Completed. Corrected errors: 1 . Uncorrected errors 0

有一个字节从1001 1001翻转为了1011 1001。改回来后checksum变一致了。

重启ClickHouse服务器,不再报错、CPU占用恢复正常。问题解决。