Urdu

#research #deduplication #projectnotes

Urdu

  • Identifier::ur
  • Language::Urdu
  • Physical Size::1.6GB
  • Number of Rows::371691
  • Unique URLs:: 371628
  • SimHash Tokenization::character 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 135, 3: 96, 2: 35, 1: 19}
  • SimHash Results:: 285 matches/167 clusters/407 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 1565304005
  • Substring Duplicate Size:: 231043706 (14.76%)

Examples

[[Pasted image 20220227194607.png]]
Links to this page