Arabic

#research #deduplication #projectnotes

Arabic

  • Identifier::ar
  • Language::Arabic
  • Physical Size::28.2 GB
  • Number of Rows::8051306
  • Unique URLs:: 8048055
  • SimHash Tokenization::character 6-gram
  • SimHash Parameters::\((4,6)\)
  • SimHash Match Distribution::{4: 16583, 3: 7440, 2: 3110, 1: 954}
  • SimHash Results:: 28087 matches/5748 clusters/17759 hashes
  • Substring Length Threshold:: \(100\)
  • Total Text Size:: 27445144911
  • Substring Duplicate Size:: 8130859489 (29.63%)

Examples

[[Pasted image 20220227100839.png]][[Pasted image 20220227100909.png]][[Pasted image 20220227101048.png]][[Pasted image 20220227101121.png]][[Pasted image 20220227101410.png]]
Links to this page