English

#research #deduplication #projectnotes

English

  • Identifier::en
  • Language::English
  • Physical Size::1TB
  • Number of Rows::413161976
  • Unique URLs:: 413027882
  • SimHash Tokenization::space-delimited 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 2028873, 3: 920150, 2: 330228, 1: 88703}
  • SimHash Results:: 3367954 matches/228591 clusters/921522 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 982249623332
  • Substring Duplicate Size:: 200379191637 (20.40%)

Space-delimited tokenization is used mainly for speed.

Links to this page