nonconsumptive

access
to texts

Core Libraries

Python

The fullest library for creating nonconsumptive files is the Python module nonconsumptive.

Depending on what format your texts are in, it may be very easy to create a representation without coding in python. If you have a few hundred texts located inside a folder called texts, and associate metadata in ‘meta.csv’, you can run the following command to create a set of bookstacks.

pip install nonconsumptive
nonconsumptive build --texts texts --metadata meta.csv --metadata-id-field filename --targets unigrams bigrams stacks srp --dir nc

Once you have done so, host it online and add the package to our registry to allow others to work with it.

For more information, see the python docs.

R

Nonconsumptive access in R is handled through the Apache-arrow package; we recommend tidytext for exploring the data that it produces.

Javascript

Javascript interaction with nonconsumptive corpora happens through duckdb. Because parquet files and structured to allow random access and duckdb-wasm makes innovative use of http requests to load things, it is possible to treat a set of bookstacks, hosted statically, as a database to be queried from the browser. This means that we can host bookstacks statically at low cost to libraries, and users can dial up only the subsets they want to look at.

Underlying technology

  • blingfire for tokenization
  • Apache Arrow for data manipulation and processing
  • Apache Parquet for data storage

Elective Affinities

The underlying data architecture here is designed to work seamlessly with a variety of other files.

  1. Tensorflow, through the arrow datasets format.
  2. Tidytext in R
  3. Hugginface