nonconsumptive

access
to texts

nc:ngrams

A struct entry that represents the counts of words in a text. This is an abstract type that should generally not be used as a column: instead, use [unigrams], [bigrams], [trigrams], etc.

At least two columns are required in the struct: word1 (and word2, word3, etc.), and count. If the tokenization includes part-of-speech tags, they should be included as pos1, pos2, etc.

count must be an integer. word1, word2, etc. should be either UTF-8 encoded strings or NULL. NULL values indicate the removal of a token, to preserve privacy, copyright, etc.

word1 count
the 32
&c 21
مرحبا 13
  • complete: does this represent an exact operation across the tokens? Ngram counts do not necessarily contain the same number of words as ‘tokenization.’ Some rows may be dropped to
  • added-noise: Have any additional words/tuples potentially been inserted into this column in the interests of differential privacy?
  • removals: Have any low.