Indonesian

#research #deduplication #projectnotes

Indonesian

  • Identifier::id
  • Language::Indonesian
  • Physical Size::19.3GB
  • Number of Rows::11555544
  • Unique URLs:: 11553941
  • SimHash Tokenization::character 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 1404992, 3: 390584, 2: 92815, 1: 17511}
  • SimHash Results:: 1905902 matches/39272 clusters/260718 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 17176417706
  • Substring Duplicate Size:: 4250966690 (24.75%)

Examples

Links to this page