CPD3 Data Structure

CPD3 data is represented as a large number of potentially overlapping values. Each value contains information that uniquely identifies it to the system.

In CPD3 data, configuration, events and any other time dependent information are all stored as a common form of "data". In most cases the final form of data/configuration is a single time dependent value. How this value is constructed is somewhat complicated but it is analogous to a stack of transparent sheets on top of one another. In this analogy, each sheet has only parts of it filled out allowing for those below it to be seen. That is, the sheets higher in the stack "override" parts of those lower in the stack and/or provide definitions for the parts that where completely "transparent" before. In the simple case of a traditional data time series, the stack is only one deep and contains a single non-transparent real number. For most forms of configuration the stack contains more levels.

Sequence Name

The first level of identification of this data are the components of the total "name" of it. Data within a single name are viewed as representing one property. In the context of traditional data, this means a single parameter in time. For configuration, this is usually a single subsystem or component being configured. The components of the name are as follows.

Station

The station is generally the three letter GAW station code for the physical site in question. Stations are not case sensitive.

An additional station called the "default station" is also present. The default station is assigned the code _ (a single underscore) and is generally always present in all data. It is most commonly used to provide configuration defaults and can generally be ignored in data querying.

Archive

The archive is the "type" of data being accessed. For example the raw archive generally specifies data as it comes out of the acquisition system with no mentor QC applied. Once those data have been passed by the mentor and had corrections applied it enters into the clean archive. Passed data averaged to one hour intervals are placed in the avgh archive. There are additional special archives that include configuration and events for the system configuration and event logs, respectively. Archives are not case sensitive.

Archives can also have an _meta suffix that contains metadata about the data or configuration in the "main" archive. For example the raw_meta archive contains the metadata about the raw archive. That metadata specifies things like physical units and output formats.

Variable

The variable is what is normally considered as an individual parameter in a time series or a column in a table. For example a single wavelength of scattering would be a single variable. In this context the green scattering from the "S11" nephelometer would be BsG_S11.

In archives that are not directly tied to measurements (e.g. the configuration archive) the variable specifies the next level of the heirarchy being addressed. For example processing specifies the configuration for parts of the automatic processing system.

Variables are case sensitive.

Flavors

The flavors of a parameter are the usually hidden qualifies of it. Parameters may have zero or more flavors applied to them. For example the pm1 flavor species that the value is size selected to less than 1μm diameter. Flavors are not case sensitive.

Some common flavors include:

PM1: Data have been size selected to less than 1μm diameter.
PM25: Data have been size selected to less than 2.5μm diameter.
PM10: Data have been size selected to less than 10μm diameter.
stats: This value represents statistics about the average rather than the averaged value itself. For example, this would contain quantiles within the averaging period.
cover: This value contains the coverage fraction (zero to one) of data within the averaging period. For example an hourly average with 45 minutes of valid data would have a coverage of 0.75. When absent, this is assumed to be one (whole time period present). This is normally used to calculate correctly weighted averages to prevent single points from throwing the average off.

Priority

The next component of data values is the concept of priority. The priority is an integer (positive or negative) that defines where in the stack the value should appear. Lower priorities are overridden by higher ones. That is, higher priorities are closer to the "top" of the transparency stack analogy above.

Time Bounds

The final component is the time range each value occupies. That is, every value for both data and configuration in the system has a start and time. Either of these can be "undefined" indicating that the value extends to infinity. Values with infinite starts or ends are considered active by the system for all time up until or only after their finite start or end. If a value has both an undefined start and end time, it is always in effect. In general the stack of values can be thought of as being evaluated at every point in time. That is, layers of the transparency can extend between multiple boundaries and do not necessarily need to align on all levels. This allows parts of the configuration to be defined by different levels corresponding to the times they affect.

Additional Sequences

In general most configuration also contains values in the default station mentioned above. The default stations is designated by a station names just _ (a single underscore). However, this is not exclusive to configuration. These values form another stack of data placed under the "main" ones in question. That is, values in the default station are always below those from normal stations, but they obey priority sorting within themselves. The values in the default station are used to provide system defaults and templates for more specialized configurations. This allows all stations to share common configuration when possible.

Finally there are values attached to most names in a "metadata" archive. Metadata values are designated by the same name as their corresponding main data values, except that the archive ends with _meta. These values do not generally affect the actual values as seen by the system. Instead, they provide additional information about the final values. For traditional data, this can be be things like formats, MVCs, wavelengths and descriptions. In the context of configuration, they usually provide information about the possible elements of the configuration or limits on what are acceptable values.

Summary

In summary, data and configuration in CPD3 consist of zero or more layers of individual values. Each value is uniquely defined by several components:

The time range: The start and end time the value is effective for. Outside of this range, the system will not "see" the value at all.
The name: The station, archive, variable and flavors (if any) that determines what the value is actually about.
The priority: An integer specifying the depth in the stack of the value is placed when the final data are calculated.

That is, there can only be a single value for a unique combination of the components. For example, to create multiple levels of configuration active for the same time range they must be given unique priorities.