Bengali

#research #deduplication #projectnotes

Bengali

  • Identifier::bn
  • Language::Bengali
  • Physical Size::5GB
  • Number of Rows::841724
  • Unique URLs:: 841571
  • SimHash Tokenization::character 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 377, 3: 221, 2: 108, 1: 50}
  • SimHash Results:: 756 matches/273 clusters/638 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 5008061572
  • Substring Duplicate Size:: 1440245010 (28.76%)

Examples

[[Pasted image 20220227152418.png]][[Pasted image 20220227152638.png]][[Pasted image 20220227152550.png]]
Links to this page