here's what will happen when node repair is request (nodetool repair)
definition:
neighbors, the node(s) has replica of the data other node(s) has. For example, you have a 7 node cluster, and token is evenly distributed (that is, each node holds 1/7 range). Assume you set replication factor to 3. That means any data you write should have 3 replicas, under default strategy (SimpleStrategy) , the next two replicas will be put on the next node along the ring. So if the data you write is on node 4, then node 5 and 6 each will also have one replica. And node 4 should also hold one replica for node 3 and one replica for node 2. Then the neighbors for node 4 will be node 2,3,4,5 and 6. Using same logic, node 6’s neighbors will be 4,5,6,7 and 1 (remember it’s a ring, when you reaches end, the next will be the first node)
you can do the same calculation for other node or for other replication factor.
Steps:
for each keyspace in Cassandra DB do below:
skip if it's system keyspace
run force table repair on the keyspace by
make sure all neighbors are up or quit
send build hash tree request to all neighbors (at same time) <--- see below for hash tree definition
when receive request, each node will do below for each column family in the keyspace
trigger a read only compaction by flush memtables and <---- possible huge physical write
build hash tree by reading all rows <---- possible huge physical read
send hash tree result back to requesting node
after received hash tree from all neighbors, the requesting node will compare hash tree result (per column family) with local result,
and if different, ask for SSTables (data file) from remote node for repair (compare all rows, update local row(s) with the latest updated row(s) ) <--- possible huge physical read on remote and read/write on local
wait until finishes or failed, then go next keyspace
So if you have write consistency level set to ALL, or you never delete any records then you don’t have to run node repair at all. ( If you don’t delete, the inserted/updated data will be synced when you access them, which is called read repair http://wiki.apache.org/cassandra/ReadRepair )
Hash tree is the way Cassandra used to efficiently determine which part of the data is out of sync among different nodes. you can find more here:
http://en.wikipedia.org/wiki/Hash_tree
and
http://wiki.apache.org/cassandra/AntiEntropy
You can observe node repair progress by set logging level to debug (in log4j.properties if you’re using default log4j) for class org.apache.cassandra.service.AntiEntropyService, and observe compaction progress by set logging level to debug for class org.apache.cassandra.db.compaction.CompactionManagerMBean.
Comments
Post a Comment