色婷婷狠狠18禁久久YY,CHINESE性内射高清国产,国产女人18毛片水真多1,国产AV在线观看

python 語義查重

江奕云2年前8瀏覽0評論

Python語義查重是一種比較文本相似度的技術,可以用于檢測抄襲等情況。在Python中,有許多開源的語義查重庫,如Gensim、Jieba、Tf-idf等。

下面我們以Tf-idf為例,介紹一下如何使用Python進行語義查重:

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# 將文本分詞,并去掉停用詞
def cut(content):
stop_words = ["的", "了", "在", "是", "我", "有", "和", "就", "不", "人", "都", "一", "一個", "上", "也", "很", "到", "說", "要", "去", "你", "會", "著", "沒有", "看", "好", "自己", "這"]
words = [i for i in jieba.cut(content) if i not in stop_words]
return " ".join(words)
# 計算兩篇文章的相似度
def similarity(content1, content2):
sentences = [cut(content1), cut(content2)]
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(sentences)
return cosine_similarity(vectors)[0][1]
if __name__ == "__main__":
content1 = "今天天氣真好,我想去散步。"
content2 = "天氣這么好,我想出去走走。"
sim = similarity(content1, content2)
print("兩篇文章的相似度為:", sim)

通過以上代碼,我們可以得出“今天天氣真好,我想去散步。”和“天氣這么好,我想出去走走。”這兩篇文章的相似度為0.84。這說明這兩篇文章比較相似,可能存在一定的抄襲。