to texts

The purpose of this project is to define standards for sharing collections of texts nonconsumptively that meet multiple goals at once.

  1. Simple but extensible metadata standards for sharing digital texts, and for characterizing what kinds of nonconsumptive and consumptive information is being distributed.
  2. Extremely fast access in Python, R, and the browser so that researchers will actually use the formats.
  3. Long-term, open, stable data formats for distribution and archiving so that stewards of digital texts can share data secure in the knowledge that they will be readable in years.
  4. Well-compressed representations so that they can be uploaded and downloaded over a network.

The Bookstack plan

What we do

The basic element of a nonconsumptive corpus is a collection we call a bookstack, composed of files called shelves. Each shelf is a single file representing many different texts; a collection is any set of stacks using an identical schema. Strictly definining schemas allows a variety of tools to work with a nonconsumptive format.

These are terms from libraries; scholarship around the turn of the 20th century was greatly facilitated by the development of steel bookstacks to form the foundation of new libraries. Bookstacks were standardized, uniform, and built to last.

There are two terms that you could use for these entities that we eschew.

“Corpus,” commonly used for collections in corpus linguistics, often implies a coherence or thoroughness to a collection. While linguistic corpora can be put into bookstacks, many textual collections don’t make good objects for corpus linguistics, and we want to avoid that.

A “Library” is a pretty good name for a collection of texts; and a library can be made up of a bunch of bookstacks. But most good libraries are also heteregenous in a way that the bookstacks described here aren’t. And in many programming languages the word “library” has its own set of meanings that would get confusing.

File Formats

Central to the data strategy here are new columnar data warehousing formats.

It is possible–though not currently supported–to distribute bookstacks in many different formats. But the ecosystem here relies especially on Apache Arrow and Apache Parquet because using them immensely simplifies access for both providers and consumers of texts.

Format Access Speed Compression Allows random access? Allows metadata Supports Schema
unicode csv - - No Yes No
Newline-delimited json - - - No Yes Yes
Unzipped directory + - Yes Yes Yes
Zipped folder - - + Yes Yes Yes
XML (TEI, etc) - - - - No Yes Yes
Apache Feather + + + + + Yes Yes Yes
Apache Parquet + + + + + Yes Yes Yes


This portion of the website defines a schema of types for representation in a nonconsumptive corpus. The goal is to steer it into full compliance with a linked open data representation, so that textual corpora can be distributed as json-ld files. But for the sake of researchers and web developers, it also privileges a data transfer format based on the Apache-foundation sponsored parquet and feather formats to allow fast computation and strict data typing.

A bookstack can have as many, or as few, of these items as it wants.