The purpose of this project is to define standards for sharing collections of texts nonconsumptively that meet multiple goals at once.
The basic element of a nonconsumptive corpus is a collection we call a bookstack, composed of files called shelves. Each shelf is a single file representing many different texts; a collection is any set of stacks using an identical schema. Strictly definining schemas allows a variety of tools to work with a nonconsumptive format.
These are terms from libraries; scholarship around the turn of the 20th century was greatly facilitated by the development of steel bookstacks to form the foundation of new libraries. Bookstacks were standardized, uniform, and built to last.
There are two terms that you could use for these entities that we eschew.
“Corpus,” commonly used for collections in corpus linguistics, often implies a coherence or thoroughness to a collection. While linguistic corpora can be put into bookstacks, many textual collections don’t make good objects for corpus linguistics, and we want to avoid that.
A “Library” is a pretty good name for a collection of texts; and a library can be made up of a bunch of bookstacks. But most good libraries are also heteregenous in a way that the bookstacks described here aren’t. And in many programming languages the word “library” has its own set of meanings that would get confusing.
Central to the data strategy here are new columnar data warehousing formats.
It is possible–though not currently supported–to distribute bookstacks in many different formats. But the ecosystem here relies especially on Apache Arrow and Apache Parquet because using them immensely simplifies access for both providers and consumers of texts.
Format | Access Speed | Compression | Allows random access? | Allows metadata | Supports Schema |
---|---|---|---|---|---|
unicode csv | - | - | No | Yes | No |
Newline-delimited json | - | - - | No | Yes | Yes |
Unzipped directory | + | - | Yes | Yes | Yes |
Zipped folder | - - | + | Yes | Yes | Yes |
XML (TEI, etc) | - - | - - | No | Yes | Yes |
Apache Feather | + + + | + + | Yes | Yes | Yes |
Apache Parquet | + + | + + + | Yes | Yes | Yes |
This portion of the website defines a schema of types for representation in a nonconsumptive corpus. The goal is to steer it into full compliance with a linked open data representation, so that textual corpora can be distributed as json-ld files. But for the sake of researchers and web developers, it also privileges a data transfer format based on the Apache-foundation sponsored parquet and feather formats to allow fast computation and strict data typing.
A bookstack can have as many, or as few, of these items as it wants.