Hindi

#research #deduplication #projectnotes

Hindi

  • Identifier::hi
  • Language::Hindi
  • Physical Size::10.2GB
  • Number of Rows::1982933
  • Unique URLs:: 1982279
  • SimHash Tokenization::character 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 4735, 3: 1855, 2: 633, 1: 185}
  • SimHash Results:: 7408 matches/914 clusters/3191 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 10327894620
  • Substring Duplicate Size:: 3087503973 (29.89%)

Examples

[[Pasted image 20220227163219.png]][[Pasted image 20220227163305.png]]
Links to this page