nc:ngrams

A struct entry that represents the counts of words in a text. This is an abstract type that should generally not be used as a column: instead, use [unigrams], [bigrams], [trigrams], etc.

At least two columns are required in the struct: word1 (and word2, word3, etc.), and count. If the tokenization includes part-of-speech tags, they should be included as pos1, pos2, etc.

count must be an integer. word1, word2, etc. should be either UTF-8 encoded strings or NULL. NULL values indicate the removal of a token, to preserve privacy, copyright, etc.

word1	count
the	32
&c	21
مرحبا	13

complete: does this represent an exact operation across the tokens? Ngram counts do not necessarily contain the same number of words as ‘tokenization.’ Some rows may be dropped to
added-noise: Have any additional words/tuples potentially been inserted into this column in the interests of differential privacy?
removals: Have any low.

nonconsumptive

nc:ngrams