Python語義查重是一種比較文本相似度的技術,可以用于檢測抄襲等情況。在Python中,有許多開源的語義查重庫,如Gensim、Jieba、Tf-idf等。
下面我們以Tf-idf為例,介紹一下如何使用Python進行語義查重:
import jieba from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # 將文本分詞,并去掉停用詞 def cut(content): stop_words = ["的", "了", "在", "是", "我", "有", "和", "就", "不", "人", "都", "一", "一個", "上", "也", "很", "到", "說", "要", "去", "你", "會", "著", "沒有", "看", "好", "自己", "這"] words = [i for i in jieba.cut(content) if i not in stop_words] return " ".join(words) # 計算兩篇文章的相似度 def similarity(content1, content2): sentences = [cut(content1), cut(content2)] vectorizer = TfidfVectorizer() vectors = vectorizer.fit_transform(sentences) return cosine_similarity(vectors)[0][1] if __name__ == "__main__": content1 = "今天天氣真好,我想去散步。" content2 = "天氣這么好,我想出去走走。" sim = similarity(content1, content2) print("兩篇文章的相似度為:", sim)
通過以上代碼,我們可以得出“今天天氣真好,我想去散步。”和“天氣這么好,我想出去走走。”這兩篇文章的相似度為0.84。這說明這兩篇文章比較相似,可能存在一定的抄襲。
上一篇python 數組前插
下一篇python 詞語關聯度