SimHash deduplication
Progress
TABLE
Identifier AS "Identifier",
row["Physical Size"] AS "Physical Size",
row["Number of Rows"] AS "Size",
row["Unique URLs"] AS "Unique URLs",
row["SimHash Tokenization"] AS "SimHash Tokenization",
row["SimHash Parameters"] AS "SimHash Parameters",
row["SimHash Match Distribution"] AS "SimHash Distribution",
row["SimHash Results"] AS "SimHash Results"
FROM #deduplication AND #projectnotes
SORT Identifier^d69b19
Goals
-
SimHash deduplication ^c61205
- Finish running deduplication by 2022-03-04
-
Suffix Array Substring deduplication ^3fb13c
- Finish running deduplication by 2022-03-04
-
Deduplication report
- Finish writing the report by 2022-03-05