English
-
Identifier::
en - Language::English
-
Physical Size::
1TB -
Number of Rows::
413161976 -
Unique URLs::
413027882 - SimHash Tokenization::space-delimited 6-gram
- SimHash Parameters::\((4,6)\)
-
SimHash Match Distribution::
{4: 2028873, 3: 920150, 2: 330228, 1: 88703} -
SimHash Results::
3367954 matches/228591 clusters/921522 hashes - Substring Length Threshold:: \(100\)
-
Total Text Size::
982249623332 -
Substring Duplicate Size::
200379191637 (20.40%)
Space-delimited tokenization is used mainly for speed.