How to make an SSSC Conforming Dataset

Checklist

Select "prog" name
Select "level" name
Build "Series_Name"
Select/Define Dataset Structure
- CONFORMS type
  - T_EPOCH/SER_TYPE
- DATAKIND type
- PROTOCOL type
- Select "Group_ID"
- "mml" name
Enter in Dataset Naming Table
Write "man" page
- Specify Analysis Module(s) required
  - Link to "man" page
  - Specify parameters required
  - Specify input datasets required
- Make sample mapfile
Generate "mml" program
Ingest "mml" program
- Test and Verify functionality

How to make an SSSC Analysis Module

Checklist

Select module name and register in $STAGING/modules
- Begin "man" page with purpose statement
Define/Select output dataset specifications:
- CONFORMS type
- DATAKIND type
- PROTOCOL type
Define/Select input dataset(s) specifications
- CONFORMS type
- DATAKIND type
Specify ARGLIST entries
Create prototype in modules/am_xxx.c
Create and test algorithm functions
Select and export sample input data
Test and Verify in shell environment
Check_in program
Complete output dataset checklist
Test in PE environment

DSDS Database Tables

The SOI Data Storage and Distribution System (DSDS) consists of an Oracle database, a large disk storage area, tape systems, offline tape storage, and software to control management and access to Datasets. There are several key database tables used to manage Datasets. These include:

ds_naming

epoch

dataset_main

All DataSeries that will be used in "automatic" pipeline processing must have certain entries in the ds_naming table and the epoch table. These presently include the Dataset name including the prog_name, prog_number, level_name, and series_name Dataset name components; the values for Overview File keywords: CONFORMS, SER_TYP, T_BLOCK, EXT_INFO, and T_EPOCH.

At present these entries must be made by the DSDS operator. At some future data a form interface will be available to maintain these entries.

Select module name and register in $STAGING/modules

Begin "man" page with purpose statement

DATASERIES

A DataSeries is a collection of Datasets which taken as a whole constitute a particular data product spanning possible a large interval of time and possible many hundreds of gigabytes of storage. A DataSeries is a sequence of Datasets where the sequence index-axis it taken to be a "time" axis. Successive Datasets, identified by the Series_Number, contain the data for particular intervals along the DataSeries sequence index-axis.

For example, the DataSeries may be all MDI Calibrated Dopplergrams from the 1996 Dynamics interval of 2-months. A Dopplergram is obtained for each minute so the c. 60 days include 86400 Dopplergrams of typically 2 Mbytes each. This DataSeries is too large for most computers to handle as a single entity. The DataSeries is split or blocked into 1440 separate Datasets each containing a one-hour interval of 60 Dopplergrams. The 1440 Datasets are numbered and identified by the Series_Number part of the SSSC Dataset name. Thus the Series_Number is really just a position on the time axis. Each Dataset is a group of 60 records, so the record axis is just an interval on the Series_Number axis.

A standard naming protocol has been developed to identify DataSeries and the Datasets that comprise them. Dataset names consist of six parts, three names and three numbers. They are the prog_name, prog_number, level_name, level_number, series_name, and series_number. See the Dataset naming document for more details.

Dataset Naming: prog_name

The prog_name part of a Dataset name specifies the project or program that the Dataset is associated with. There are a number of recognized SSSC "programs" including:

     mdi          MDI flight datasets
     mdi_ground   Test MDI data obtained prior to April 1994
     lowl         Data from the HAO Lowl instrument
     gong         Data from the GONG program
     wso          Data from WSO

The purpose of the "prog" classification is to allow grouping of dataset descriptions and to allow co-location of associated data. The prog_name is also used to identify the unix environment variable of the same name which contains the rule or template used to map Dataset names onto the filesystem. A different mapping is allowed for each "prog" for each user environment. This allows use of SSSC analysis modules in the shell environment for program development and debugging purposes. It also allows use of the SSSC analysis modules outside the SSSC itself.

See also the Dataset naming style and Dataset naming document.

Dataset Naming: prog_number

There is currently no standard use of the prog_number Dataset name element. All existing datasets have a prog_number set to "-1".

See also the Dataset naming document.

Dataset Naming: level_name

The level_name Dataset name element is used to describe the reduction level of the data. The normal processing stream for MDI data includes:

      raw      MDI telemetry data as received at SSSC
      merge    MDI telemetry data merged into contiguous chunks
      lev0     MDI data grouped by hour and by DPC (Data Product Code)
      lev1     MDI calibrated data by hour and DPC
      lev1.5   MDI data with ancillary data, by hour and observable
      lev2     MDI helioseismology or other intermediate data products

There are many other standard level_names in use. A more detailed discussion of the naming style is available, as is a list of the current names in use.

See also the Dataset naming document.

Dataset Naming: level_number

The level_number Dataset name element is used to encode the Dataset version number. Due to program bugs, calibration versions, input data revisions, etc., it may be necessary to generate a particular Dataset more than once. Since archived data in the DSDS is "write-once, read-many" the Dataset name must change each time a new Dataset is created. Full Dataset names are unique.

To allow the reuse of the logical Dataset name we have reserved the "level_number" name part to define the processing version number for the "Dataset" as defined by the remaining 5 name elements. The level_number is handled by the DSDS as follows:

If the level_number is explicitly specified as "0" the Dataset is considered temporary and is not archived. Subsequent creation of the same Dataset, with level_number 0, will overwrite any existing Dataset with level_number 0.

If the level_number is not specified on access of an existing Dataset, the Dataset with the largest level_number will be returned by the DSDS.

If the level_number is not specified on Dataset creation, the Dataset will be entered into the DSDS database with the next larger level_number.

In this way, testing and development can proceed without generating large amounts of archived junk and normal processing can proceed without detailed user attention to the level number.

See also the Dataset naming document.

Dataset Naming: series_name

The series_name Dataset name element is close to what a user would normally think of as the dataset name. The series_name should identify the data quantity and it should provide an indication of the blocking along the Dataset series_number index axis, where appropriate. For example, MDI level-0 and level-1 data products have series_names formed by concatenating the MDI Data Product Code (DPC) and the time blocking, "01h" for hourly blocked data. Thus the series_name might be: ffffff04_01h for 1024x1024 filtergrams.

In addition to forming an important part of the logical name of the Dataset, the series_name usually forms an important part of the actual file names. Within the DSDS/PE pipeline environment there can be only one Dataset in one filesystem directory. The directory is selected by the DSDS. The filenames of the actual files that constitute the Dataset are formed from standardized rules using a template unique for each prog_name. Each file containing part of the data part of the Dataset is built by appending extensions to a "basename". The basename is usually built from the series_name and series_number. For this reason, the series_name should be built from characters that form reasonable file names. Datasets which meet certain content and naming conventions are called Conforming Datasets.

See also the Dataset naming document.

Dataset Naming: series_number

The series_number Dataset name element selects a particular Dataset of a DataSeries. For Conforming Datasets the series_number is an index along a time or time-like index axis. The nominal start time for data in a Dataset with series_number N is:

T_START = T_EPOCH + N * T_BLOCK

This convention allows the PE mapfile generation programs to compute the series_numbers that contain data for a date or range of dates (and times). DataCollections are often specified by a DataSeries name with a list or range of series_numbers.

See also the Dataset naming document.

CONFORMS type

The CONFORMS specification for a SSSC Dataset defines the overall structure of the DataSeries of which the Dataset is a part.

A Dataset is a collection of records. The index to the records may in general be considered as a record_number axis. For Conforming Datasets, the record_number axis is generally an interval on the DataSeries axis. The CONFORMS keyword specifies the relation between Series_Numbers, record numbers, and an underlying physical "time" axis.

The "time" axis is often normal time but may be any indexable axis. The SER_TYP keyword defines the type and particular instance of the Series_Number/Record axis. The SER_TYP keyword contains the name of an SSSC standard "time" axis. There is a SSSC Database table which contains the T_EPOCH for each registered SER_TYP. T_EPOCH contains a string which consists of two parts seperated by the final underscore "_" character in the string. The string fragment after the underscore is the "time" axis type. The string fragment before the underscore is the "time" axis reference value.

For example, the SER_TYP for the MDI Doppler data is "t_obs" which might specify T_EPOCH to be 1995.11.23_00h:00m:00s_UT so the MDI Doppler Series axis is UT time referenced to the day of SOHO launch.

For further information see the Conforming Dataset document.

DATAKIND type

The DATAKIND keyword specifies the type of data stored in the Dataset. It is used only in an advisory manner. Some programs which can only process a particular structure of input data will test the value of DATAKIND to insure that the processing can be completed. The programmer can also use the DATAKIND value to indicate the list of keyword-value attributes that are expected for that particular type of data. For most standard SSSC DATAKINDs there is a "man page" enumerating the keywords expected in either the Overview file or in the per-record attribute files. For example, if the Dataset contains level-2 input Dopplergrams the Overview file will contain the line:

DATAKIND='L2_VELOCITY'

For further information see the Conforming Dataset document.

PROTOCOL type

The PROTOCOL keyword specifies the external storage protocols used for the per-record keyword-value attributes and for the data records in an SSSC Dataset. Typically the per-record information is stored in /rdb format files and the data records are stored as FITS files. In that case the Dataset Overview file will contain the line:

PROTOCOL='RDB.FITS'

For further information see the Conforming Dataset document.

Dataset(s) specifications

An SSSC Analysis Module (a.k.a. Strategy Module) usually requires one or more input SSSC Datasets which will be used by the module and usually creates one or more output Datasets. For each input purpose there will be an input ARGLIST specification of the type: ARG_DATA_IN. For each output purpose there will be an output ARGLIST specification of the type: ARG_DATA_OUT.

For example, the Doppler calibration module for MDI level-1 processing requires both input data to be calibrated and calibration tables appropriate for that data type. The ARGLIST specification found in src/modules/dopcal.c is:

{ARG_DATA_IN, "in", "", "", ""},

{ARG_DATA_IN, "dopcalib", "", ""},

{ARG_DATA_OUT, "out", "", ""},

When the program, (i.e. module) is run the particular input Datasets needed and the file space reserved for the output datasets will be communicated to the module via the KEY *param argument to the analysis module. If the module is run in the shell environment this is done via the normal command line argument method. If the module is run in the PE environment, the pe main program accomplishes the same task.

For example, if the input dataset is hour 5 of day 10 for MDI Data Product Code ffffff04, the input spec would be:

in=prog:mdi,level:lev0,series:ffffff04_01h[245]

Using the first version of calibration tables the calibration table spec would be:

dopcalib=prog:mdi_calib,level:doppler_cal,series:ffffff04[0]

and the output spec might be:

out=prog:mdi,level:lev1,series:ffffff04_01h[245]

Normally one invocation of an analysis module will be used to process multiple input and output datasets. For this purpose, the input and output ARGLIST parameters actually expect a DataCollection specification. A DataCollection is a set of Datasets. A DataCollection is specified to an analysis module by specifying a set of Datasets as a list of DataSeries ranges. A DataSeries list is simply a semicolon delimited list of Dataset or DataSeries range specifications. A DataSeries range is a set of Datasets all from the same DataSeries enumerated by specifying a list or range of Series_Numbers.

For example, in the above example if hours 5, 6, and 7 of MDI day 10 were to be processed the input specification could be:

in= prog:mdi,level:lev0,series:ffffff04_01h[245,246,247]

or, equivalently:

in= prog:mdi,level:lev0,series:ffffff04_01h[245-247]

The Analysis Module must determine the number of Datasets passed to the module by accessing the KEYlist via a keyword formed by appending an "underscore n", (_n), to the ARGLIST parameter.

For example, the c fragment:

KEY *param; /* from module formal parameter list */

int number_in_datasets;

number_in_datasets = getkey_int(param,"in_n");

The Analysis Module has the responsibility for implementing a strategy for dealing with the DataCollection.

Output DataCollections specify a set of working directories where the output Datasets will be written by the module. Within the PE environment a size estimate will have been provided to PE and sufficient file space will have been assigned prior to invoking the module. In the shell environement the user must provide the space.

In the shell environment the binding between Dataset names and a working directory is provided by a rule or template taken from the shell environment variable named for the Dataset Project_Name (a.k.a. Program_Name, Prog_Name, "prog:" class). The user is free to use any of the six Dataset name elements to build a working directory pathname. The environment template consists of literal characters and Dataset name element placeholders enclosed in "{}" pairs. The template is also used to specify the format of the Dataset Basename which is used to build the actual filenames.

For example, a test run of the Doppler calibration might use a dataset in the directory "/scr24/phil/doptest/lev0/". With the template defined by the csh command script fragment:

set BASEDIR=/scr24/phil/doptest

set noglob;

setenv CALTEST "wd:$BASEDIR/{level}/;bn:{series}{#%d#series}"

the input specification would be:

in=prog:CALTEST,level:lev0,series=ffffff04_01h[245]

Within the PE environment the template is formed in a similar manner but an additional pathname source, {dbase}, is available and is substituted with the pathname determined by the query to the DSDS database.

How to make an SSSC Conforming Dataset

How to make an SSSC Analysis Module

DSDS Database Tables

Select module name and register in $STAGING/modules

Dataset Naming: prog_name

Dataset Naming: prog_number

Dataset Naming: level_name

Dataset Naming: level_number

Dataset Naming: series_name

Dataset Naming: series_number

CONFORMS type

DATAKIND type

PROTOCOL type

ARGLIST

Complete output dataset checklist