The SOI Data Storage and Distribution System (DSDS) consists of an Oracle database, a large disk storage area, tape systems, offline tape storage, and software to control management and access to Datasets. There are several key database tables used to manage Datasets. These include:
ds_naming
epoch
dataset_main
All DataSeries that will be used in "automatic" pipeline processing must have certain entries in the ds_naming table and the epoch table. These presently include the Dataset name including the prog_name, prog_number, level_name, and series_name Dataset name components; the values for Overview File keywords: CONFORMS, SER_TYP, T_BLOCK, EXT_INFO, and T_EPOCH.
At present these entries must be made by the DSDS operator. At some future data a form interface will be available to maintain these entries.
A DataSeries is a collection of Datasets which taken as a whole constitute a particular data product spanning possible a large interval of time and possible many hundreds of gigabytes of storage. A DataSeries is a sequence of Datasets where the sequence index-axis it taken to be a "time" axis. Successive Datasets, identified by the Series_Number, contain the data for particular intervals along the DataSeries sequence index-axis.
For example, the DataSeries may be all MDI Calibrated Dopplergrams from the 1996 Dynamics interval of 2-months. A Dopplergram is obtained for each minute so the c. 60 days include 86400 Dopplergrams of typically 2 Mbytes each. This DataSeries is too large for most computers to handle as a single entity. The DataSeries is split or blocked into 1440 separate Datasets each containing a one-hour interval of 60 Dopplergrams. The 1440 Datasets are numbered and identified by the Series_Number part of the SSSC Dataset name. Thus the Series_Number is really just a position on the time axis. Each Dataset is a group of 60 records, so the record axis is just an interval on the Series_Number axis.
A standard naming protocol has been developed to identify DataSeries and the Datasets that comprise them. Dataset names consist of six parts, three names and three numbers. They are the prog_name, prog_number, level_name, level_number, series_name, and series_number. See the Dataset naming document for more details.
The prog_name part of a Dataset name specifies the project or program that the Dataset is associated with. There are a number of recognized SSSC "programs" including:
mdi MDI flight datasets mdi_ground Test MDI data obtained prior to April 1994 lowl Data from the HAO Lowl instrument gong Data from the GONG program wso Data from WSO
The purpose of the "prog" classification is to allow grouping of dataset descriptions and to allow co-location of associated data. The prog_name is also used to identify the unix environment variable of the same name which contains the rule or template used to map Dataset names onto the filesystem. A different mapping is allowed for each "prog" for each user environment. This allows use of SSSC analysis modules in the shell environment for program development and debugging purposes. It also allows use of the SSSC analysis modules outside the SSSC itself.
See also the Dataset naming style and Dataset naming document.
There is currently no standard use of the prog_number Dataset name element. All existing datasets have a prog_number set to "-1".
See also the Dataset naming document.
The level_name Dataset name element is used to describe the reduction level of the data. The normal processing stream for MDI data includes:
raw MDI telemetry data as received at SSSC merge MDI telemetry data merged into contiguous chunks lev0 MDI data grouped by hour and by DPC (Data Product Code) lev1 MDI calibrated data by hour and DPC lev1.5 MDI data with ancillary data, by hour and observable lev2 MDI helioseismology or other intermediate data products
There are many other standard level_names in use. A more detailed discussion of the naming style is available, as is a list of the current names in use.
See also the Dataset naming document.
The level_number Dataset name element is used to encode the Dataset version number. Due to program bugs, calibration versions, input data revisions, etc., it may be necessary to generate a particular Dataset more than once. Since archived data in the DSDS is "write-once, read-many" the Dataset name must change each time a new Dataset is created. Full Dataset names are unique.
To allow the reuse of the logical Dataset name we have reserved the "level_number" name part to define the processing version number for the "Dataset" as defined by the remaining 5 name elements. The level_number is handled by the DSDS as follows:
If the level_number is explicitly specified as "0" the Dataset is considered temporary and is not archived. Subsequent creation of the same Dataset, with level_number 0, will overwrite any existing Dataset with level_number 0.
If the level_number is not specified on access of an existing Dataset, the Dataset with the largest level_number will be returned by the DSDS.
If the level_number is not specified on Dataset creation, the Dataset will be entered into the DSDS database with the next larger level_number.
In this way, testing and development can proceed without generating large amounts of archived junk and normal processing can proceed without detailed user attention to the level number.
See also the Dataset naming document.
The series_name Dataset name element is close to what a user would normally think of as the dataset name. The series_name should identify the data quantity and it should provide an indication of the blocking along the Dataset series_number index axis, where appropriate. For example, MDI level-0 and level-1 data products have series_names formed by concatenating the MDI Data Product Code (DPC) and the time blocking, "01h" for hourly blocked data. Thus the series_name might be: ffffff04_01h for 1024x1024 filtergrams.
In addition to forming an important part of the logical name of the Dataset, the series_name usually forms an important part of the actual file names. Within the DSDS/PE pipeline environment there can be only one Dataset in one filesystem directory. The directory is selected by the DSDS. The filenames of the actual files that constitute the Dataset are formed from standardized rules using a template unique for each prog_name. Each file containing part of the data part of the Dataset is built by appending extensions to a "basename". The basename is usually built from the series_name and series_number. For this reason, the series_name should be built from characters that form reasonable file names. Datasets which meet certain content and naming conventions are called Conforming Datasets.
See also the Dataset naming document.
The series_number Dataset name element selects a particular Dataset of a DataSeries. For Conforming Datasets the series_number is an index along a time or time-like index axis. The nominal start time for data in a Dataset with series_number N is:
T_START = T_EPOCH + N * T_BLOCK
This convention allows the PE mapfile generation programs to compute the series_numbers that contain data for a date or range of dates (and times). DataCollections are often specified by a DataSeries name with a list or range of series_numbers.
See also the Dataset naming document.
The CONFORMS specification for a SSSC Dataset defines the overall structure of the DataSeries of which the Dataset is a part.
A Dataset is a collection of records. The index to the records may in general be considered as a record_number axis. For Conforming Datasets, the record_number axis is generally an interval on the DataSeries axis. The CONFORMS keyword specifies the relation between Series_Numbers, record numbers, and an underlying physical "time" axis.
The "time" axis is often normal time but may be any indexable axis. The SER_TYP keyword defines the type and particular instance of the Series_Number/Record axis. The SER_TYP keyword contains the name of an SSSC standard "time" axis. There is a SSSC Database table which contains the T_EPOCH for each registered SER_TYP. T_EPOCH contains a string which consists of two parts seperated by the final underscore "_" character in the string. The string fragment after the underscore is the "time" axis type. The string fragment before the underscore is the "time" axis reference value.
For example, the SER_TYP for the MDI Doppler data is "t_obs" which might specify T_EPOCH to be 1995.11.23_00h:00m:00s_UT so the MDI Doppler Series axis is UT time referenced to the day of SOHO launch.
For further information see the Conforming Dataset document.
The DATAKIND keyword specifies the type of data stored in the Dataset. It is used only in an advisory manner. Some programs which can only process a particular structure of input data will test the value of DATAKIND to insure that the processing can be completed. The programmer can also use the DATAKIND value to indicate the list of keyword-value attributes that are expected for that particular type of data. For most standard SSSC DATAKINDs there is a "man page" enumerating the keywords expected in either the Overview file or in the per-record attribute files. For example, if the Dataset contains level-2 input Dopplergrams the Overview file will contain the line:
DATAKIND='L2_VELOCITY'
For further information see the Conforming Dataset document.
The PROTOCOL keyword specifies the external storage protocols used for the per-record keyword-value attributes and for the data records in an SSSC Dataset. Typically the per-record information is stored in /rdb format files and the data records are stored as FITS files. In that case the Dataset Overview file will contain the line:
PROTOCOL='RDB.FITS'
For further information see the Conforming Dataset document.
An SSSC Analysis Module (a.k.a. Strategy Module) usually requires one or more input SSSC Datasets which will be used by the module and usually creates one or more output Datasets. For each input purpose there will be an input ARGLIST specification of the type: ARG_DATA_IN. For each output purpose there will be an output ARGLIST specification of the type: ARG_DATA_OUT.
For example, the Doppler calibration module for MDI level-1 processing requires both input data to be calibrated and calibration tables appropriate for that data type. The ARGLIST specification found in src/modules/dopcal.c is:
{ARG_DATA_IN, "in", "", "", ""},
{ARG_DATA_IN, "dopcalib", "", ""},
{ARG_DATA_OUT, "out", "", ""},
For example, if the input dataset is hour 5 of day 10 for MDI Data Product Code ffffff04, the input spec would be:
in=prog:mdi,level:lev0,series:ffffff04_01h[245]
Using the first version of calibration tables the calibration table spec would be:
dopcalib=prog:mdi_calib,level:doppler_cal,series:ffffff04[0]
and the output spec might be:
out=prog:mdi,level:lev1,series:ffffff04_01h[245]
For example, in the above example if hours 5, 6, and 7 of MDI day 10 were to be processed the input specification could be:
in= prog:mdi,level:lev0,series:ffffff04_01h[245,246,247]
or, equivalently:
in= prog:mdi,level:lev0,series:ffffff04_01h[245-247]
The Analysis Module must determine the number of Datasets passed to the module by accessing the KEYlist via a keyword formed by appending an "underscore n", (_n), to the ARGLIST parameter.
For example, the c fragment:
KEY *param; /* from module formal parameter list */
int number_in_datasets;
number_in_datasets = getkey_int(param,"in_n");
The Analysis Module has the responsibility for implementing a strategy for dealing with the DataCollection.
Output DataCollections specify a set of working directories where the output Datasets will be written by the module. Within the PE environment a size estimate will have been provided to PE and sufficient file space will have been assigned prior to invoking the module. In the shell environement the user must provide the space.
In the shell environment the binding between Dataset names and a working directory is provided by a rule or template taken from the shell environment variable named for the Dataset Project_Name (a.k.a. Program_Name, Prog_Name, "prog:" class). The user is free to use any of the six Dataset name elements to build a working directory pathname. The environment template consists of literal characters and Dataset name element placeholders enclosed in "{}" pairs. The template is also used to specify the format of the Dataset Basename which is used to build the actual filenames.
For example, a test run of the Doppler calibration might use a dataset in the directory "/scr24/phil/doptest/lev0/". With the template defined by the csh command script fragment:
set BASEDIR=/scr24/phil/doptest
set noglob;
setenv CALTEST "wd:$BASEDIR/{level}/;bn:{series}{#%d#series}"
the input specification would be:
in=prog:CALTEST,level:lev0,series=ffffff04_01h[245]
Within the PE environment the template is formed in a similar manner but an additional pathname source, {dbase}, is available and is substituted with the pathname determined by the query to the DSDS database.