MySQL Simhash 是一種基于文本內容生成 Hash 值的算法,可以用來實現文本去重、相似查詢等功能。
CREATE FUNCTION simhash(text TEXT) RETURNS BIGINT DETERMINISTIC BEGIN DECLARE words TEXT; DECLARE word TEXT; DECLARE stopwords TEXT; DECLARE hash BIGINT DEFAULT 0; DECLARE weight INT DEFAULT 1; DECLARE bits INT DEFAULT 64; DECLARE i INT; SET words = REPLACE(text, '[^\w\x80-\xff]+', ' '); SET words = LOWER(words); SET stopwords = 'a an and are as at be by for from had he i in is it' + ' of on or that the there this to was with'; SET i = 1; wordloop: WHILE i<= LENGTH(words) DO SET word = SUBSTRING_INDEX(SUBSTRING(words, i), ' ', 1); SET i = i + LENGTH(word) + 1; IF FIND_IN_SET(word, stopwords) THEN SET weight = -1; ELSE SET weight = 1; END IF; SET hash = hash + weight * CRC32(word); END WHILE; SET i = 1; SET bits = 64; SET hash = 0; bitloop: WHILE i<= bits DO SET hash = hash | ((BIT_COUNT(hash >>i) MOD 2)<< i-1); SET i = i + 1; END WHILE; RETURN hash; END;
Simhash 函數的實現過程分為兩部分:分詞和 Hash 計算。輸入的文本通過正則表達式替換成空格,并轉成小寫后作為參數傳入函數。函數會對字符串進行遍歷,用空格分隔,判斷詞語是否為停用詞,然后計算出每個詞語的 CRC32 值,并根據權重累加起來。最后將累加結果按位重新排序生成 simhash 值。