Data stores – collections of data records

If you download raw.zip and unzip it, you will see it contains 1,035 files ending with a .fa filename suffix. (It also contains a tab delimited file and a log file, which we ignore for now.) The directory raw is a “data store” and the .fa files are “members” of it. In summary, a data store is a collection of members of the same “type”. This means we can apply the same application to every member.

How do I use a data store?

A data store is just a “container”. To open a data store you use the open_data_store() function. To load the data for a member of a data store you need an appropriately selected loader type of app.

Types of data store

Class Name

Supported Operations

Supported Data Types

Identifying Suffix

DataStoreDirectory

read / write / append

text

None

ReadOnlyDataStoreZipped

read

text

.zip

DataStoreSqlite

read, write, append

text or bytes

.sqlitedb

Note

The ReadOnlyDataStoreZipped is just a compressed DataStoreDirectory.

The structure of data stores

If a directory was not created by cogent3 as a DataStoreDirectory then it has only the structure that existed previously.

If a data store was created by cogent3, either as a directory or as a sqlitedb, then it contains four types of data: completed records, not completed records, log files and md5 files. In a DataStoreDirectory, these are organised using the file system. The completed members are valid data records (as distinct from not completed) and are at the top level. The remaining types are in subdirectories.

demo_dstore
├── logs
├── md5
├── not_completed
└── ... <the completed members>

logs/ stores scitrack log files produced by cogent3.app writer apps. md5/ stores plain text files with the md5 sum of a corresponding data member which are used to check the integrity of the data store.

The DataStoreSqlite stores the same information, just in SQL tables.

Supported operations on a data store

All data store classes can be iterated over, indexed, checked for membership. These operations return a DataMember object. In addition to providing access to members, the data store classes have convenience methods for describing their contents and providing summaries of log files that are included and of the NotCompleted members (see The NotCompleted object).

Opening a data store

Use the open_data_store() function, illustrated below. Use the mode argument to identify whether to open as read only (mode="r"), write (mode=w) or append(mode="a").

Opening a read only data store

We open the zipped directory described above, defining the filenames ending in .fa as the data store members. All files within the directory become members of the data store (unless we use the limit argument).

Summarising the data store

The .describe property demonstrates that there are only completed members.

Data store “members”

Get one member

You can index a data store like other Python series, in the folowing case the first member.

Looping over a data store

This gives you one member at a time.

Members can read their own data

Note

For a DataStoreSqlite member, the default data storage format is bytes. So reading the content of an individual record is best done using the load_db app.

Making a writeable data store

The creation of a writeable data store is specified with mode="w", or (to append) mode="a". In the former case, any existing records are overwritten. In the latter case, existing records are ignored.

DataStoreSqlite stores serialised data

When you specify a Sqlitedb data store as your output (by using open_data_store()) you write multiple records into a single file making distribution easier.

One important issue to note is the process which creates a Sqlitedb “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

To unlock, you execute the following:

Interrogating run logs

If you use the apply_to() method, a scitrack logfile will be stored in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

Log files can be accessed vial a special attribute.

Each element in that list is a DataMember which you can use to get the data contents.

Pulling it all together

We will translate the DNA sequences in raw.zip into amino acid and store them as sqlite database. We will interrogate the generated data store to gtet a synopsis of the results.

Defining the data stores for analysis

Loading our input data

Creating our output DataStoreSqlite

Create an app and apply it

We need apps to load the data, translate it and then to write the translated sequences out. We define those and compose into a single app.

We apply the app to all members of in_dstore. The results will be written to out_dstore.

Inspecting the outcome

The .describe method gives us an analysis level summary.

We confirm the data store integrity

We can examine why some input data could not be processed by looking at the summary of the not completed records.

We see they all came from the translate_seqs step. Some had a terminal stop codon while others had a length that was not divisible by 3.

Note

The .completed and .not_completed attributes give access to the different types of members while the .members attribute gives them all. For example,

is the same as in the describe output and each element is a DataMember.