Datasets within the SSSC environment consist of all contents of a unix directory. A Dataset is the atomic unit of data that is stored, archived, imported, exported, etc. by the SSSC DSDS. Datasets may contain MDI data, calibration data, processing control tables, user's backup files, etc. I.e., anything that can be put into a directory and given to the DSDS for storage is a Dataset.
In order to provide a more specific set of rules to enable standard processing programs, we have defined the class of Datasets known as Conforming Datasets. Conforming Datasets are SSSC Datasets which meet certain rules governing their contents.
A Conforming Dataset must have a file known as an Overview File which describes the class of rules followed in constructing the Dataset. The Overview File is stored as a FITS file containing only a FITS header. The Overview File must contain several specific FITS keywords which describe the data storage format, logical organization, kind of data, file naming specifics, etc. The Overview File may also contain keyword attributes that contain information that is global to the dataset, i.e. common to all data records that may be in the dataset.
A Conforming Dataset in conceptually organized as a collection of data variables organized in a set of data records. Keyword attribute information may be present for each data record for each data variable. Additional information pertaining to the creation of the Dataset may also be present in the form of additional files. Thus a Conforming Dataset consists of at least four types of information: the Overview File; Per-record attribute information; Data; and other information. The "other" information may be generated by the programs that make the Dataset and will not be further described here.
Conforming Dataset file names for the Overview, Record, and Data files are built according to strict rules. They are based on the "Series Name" and "Series Number" parts of an SSSC Dataset Name. Dataset Names are described in the SSSC Dataset Naming document. The Overview file contains the rules for naming the Record and Data files. The Overview file is named:
{Basename}.overview.fits
where {Basename} is formed by a rule using {Series_Name} and {Series_Number}. {Series_Name} and {Series_Number} are from the Dataset Name. Basename may be empty. The normal and preferred form for Basename is:
{Series_Name}.{Series_Number}
Thus, it is possible to deduce the file names for a Conforming Dataset without knowing the Dataset Name. When the preferred form for Basename is used it is also possible to export multiple Conforming Datasets to the same non-DSDS directory for analysis outside the SSSC Pipeline Processing environment. (Note that since there are no file name constraints on the "other" files that may be present so they may be lost if multiple datasets are exported to the same directory.)
The Overview File is stored as a SIMPLE FITS file with a FITS header but no data section. The header section contains a collection of keyword-value pairs. The Overview File must contain certain keywords and may contain additional information. All of the keywords in the Overview File are considered to apply to the entire Dataset and may therefore be used to store global atributes that would otherwise be redundantly stored in each record of the per-record attribute information. The Required keywords for Overview files can be found here.
The Overview file must also contain the name, version, and invocation parameters of the program(s) that created the Dataset. Generally only the most recent program will be represented with early stages in processing available by following the chain back via the input Datasets.
Some of the keywords found in Overview files are also available in the DSDS databases and can be the basis of queries to locate datasets of interest. These include: CONFORMS, SER_TYP, T_EPOCH, T_BLOCK, EXT_INFO (T_STEP), T_FIRST, T_LAST. These DSDS database entries are also used for automatic identification of Datasets required for certain standard pipeline processing programs. The T_FIRST and T_LAST keywords are automatically read from the Overview file and updated in the DSDS database when a Dataset is made in the SSSC PE Pipeline Processing environment. The other Overview File keywords that are in the database are entered in the "ds_naming" and "epoch" tables when the Dataset is first created.
The Record Info part of the Conforming Dataset is usually stored as an ascii table using the /rdb storage protocols. These Datasets will have one RDB file per data variable. The name of the file will be:
{Series_Name}.{Series_Number}.record.rdb
or
{Series_Name}.{Series_Number}.{Variable_Name}.record.rdb
The first form is usually used in uni-variate datasets, where the default Variable_Name is often empty.
/rdb format files contain simple ascii text lines which constitute a table consisting of columns called fields and rows called records. They contain a pair of header lines which contain field names and field width markers. Columns are seperated by TAB characters. A sample record information file is available here.
The particular set of keyword-value attributes included for a particular Dataset will depend on the type of data stored in the Dataset. The DATAKIND keyword in the Overview file usually will indicate a type of dataset and imply a particular list of keywords that can be expected. Some of the keywords expected for a particular DATAKIND are global in nature, applying to all of the data records in the Dataset. These may be stored in the Overview file so that only one instance need be present. Thus the program that uses the Dataset must look in both the Overview file and the Record info file for any particular keyword. If a keyword appears in both places, the one in the Record file must be used. (Note: if the Data records are stored as FITS files, some per-record attribute information may also be present in the data record FITS headers. The values found in the Record and Overview files override values found in the Data FITS files. If the Dataset was prepared for export, the keywords found in the Data record FITS files will be copies of those in the Record info files.)
Datasets made using the "vds_close" function in the SSSC "Virtual DataSet"function library will have the Keywords: I_DREC, DATAFILE, T_REC, and T_OBS in the first four columns. The lists of keywords present for particular datasets can be found by examining the Dataset Keyword Lists.
For example, for level-1.5 Doppler data, the keywords expected are here.
The actual data for the Dataset is stored in one or more files, depending on the value of the second field in the PROTOCOL keyvalue found in the Overview file. The name(s) of the files containing the data can be deduced by a rule found in the Overview file. The file name depends on the PROTOCOL and is::
PROTOCOL=xxx.FITS
{Series_Name}.{Series_Number}.{Variable_Name}.{record_number}.fits
or optionally for univariate data:
{Series_Name}.{Series_Number}.{record_number}.fits
PROTOCOL=xxx.FITS_MERGE
{Series_Name}.{Series_Number}.fits
PROTOCOL=xxx.CDF
{Series_Name}.{Series_Number}.cdf
The filenames are also stored in the Record info files (or Overview file) under the keyword DATAFILE.
Since the descriptors of the Dataset are located in the Overview file and the Record information files, only the actual data values should be taken from the Data record files. The Data record files contain the data storage type (e.g. integer, float, etc) and the dimension information for images, data vectors, etc. as well as the data.
For a description of how to make and use SSSC Datasets, see the
How_To document.