Indonesian
-
Identifier::
id - Language::Indonesian
-
Physical Size::
19.3GB -
Number of Rows::
11555544 -
Unique URLs::
11553941 - SimHash Tokenization::character 6-gram
- SimHash Parameters::\((4,6)\)
-
SimHash Match Distribution::
{4: 1404992, 3: 390584, 2: 92815, 1: 17511} -
SimHash Results::
1905902 matches/39272 clusters/260718 hashes - Substring Length Threshold:: \(100\)
-
Total Text Size::
17176417706 -
Substring Duplicate Size::
4250966690 (24.75%)
Examples