A struct entry that represents the counts of words in a text. This is an abstract type that should generally not be used as a column: instead, use [unigrams], [bigrams], [trigrams], etc.
At least two columns are required in the struct: word1
(and word2
, word3
, etc.),
and count
. If the tokenization includes part-of-speech tags, they should be included as pos1
, pos2
, etc.
count
must be an integer. word1
, word2
, etc. should be either UTF-8 encoded strings or NULL. NULL values
indicate the removal of a token, to preserve privacy, copyright, etc.
word1 | count |
---|---|
the | 32 |
&c | 21 |
مرحبا | 13 |