nonconsumptive

access
to texts

nc:tokenization

A list of tokens, representing the full text of a document in order. Tokenization strategies may remove whitespace and non-word categories.

Columnar metadata.

  • nc:tokenization-software. string. The specific program used, with version number. e.g. ‘blingfire-1.6.2’.
  • nc:tokenization-strategy. string. The tokenization rules applied. Recommended terminology.
    • ‘regex’: A regular expression.
    • ‘words’: A linguistically informed effort to create.