SimHash Results
Format
#matches.tsv (diff means simhash bit difference)
id1 id2 diff
5676323 8347653 4
20899 10053778 4
#clusters.tsv (-1 means no cluster)
id hash cluster
0 2471784231621897202 -1
1 16314724221857303546 -1
4 10666012509495373957 -1Analysis
False positives in long documents: Because SimHash is essentially a BOW algorithm that long documents are more likely ended up being similar to each other, it might be a good idea to combine Suffix Array Substring deduplication when filtering long duplicates:
-
Remove short duplicates or closely-similar documents – documents with small bit difference – based on SimHash (e.g
len <= 1024ordiff <= 3) - Remove substring duplicates based on Suffix Array
The time took for SimHash hashing and clustering is usually faster given the same data and computation resources than Suffix Array, e.g:
- It took a few hours to hashing and clustering English data with 96 cores and 1.4TB memory, while it took more than a day just to build the suffix array;
Suffix Array Results
Note: All offsets are byte offsets instead of string offsets.
Format
# text.csv (line based documents, each document has replaced the newline character with space)
document1
document2
# ids.csv (line based ids)
id0
id1
# sa.txt (byte offsets of duplicates from text.csv)
0 288
658 765
982 1246
1298 1586
# substring_bytes.tsv (restored line-based byte offsets)
# e.g. document 0 has two byte sequence duplicates: doc0[0:288] and doc0[658:765]
id x y
0 0 288
0 658 765
1 150 414
1 466 754Analysis
The threshold is applied both to the generation of sa.txt and the restoration of substring_bytes.tsv, meaning:
-
If a substring is shorter than \(50\) bytes in
text.csv, it will be ignored; - If a substring spans multiple documents with a length longer than \(50\), then during document boundary restoration aka. mapping the substring back to the each documents, any document-bound sub-substring that is shorter than \(50\) bytes will be further ignored;