SOI Data Product Quality Logs

SOI TN 97-138
R S Bogart
1997.04.11

Introduction

Information concerning the availability and quality of various kinds of data should always be available both interactively to guide users in decisions regarding analysis and processing and during program execution to enable automated selection procedures to be invoked. This document describes the various files and interface procedures required to provide such information.

The different purposes to which information on data quality and availability can be put dictate different requirements on how and where the information is kept. If the information is to be used to select particular datasets for examination, processing, or analysis, then that information must be kept in a separate location from the data themselves. On the other hand, quality parameters that are associated with the data as an integral part of any delivered or distributed data product are most efficiently kept within the dataset itself. (It is theoretically possible to incorporate quality information from outside as a step in data distribution, but this would be complex and might have major ramifications for the operation of the DSDS.) Recall that a dataset is equivalent to a directory containing files corresponding to individual data, images, sequences, etc., and that the dataset is the atomic unit at which the DSDS database is organized. The typical full SOI dataset contains 60 to 1440 images (files), each of which must be individually characterized as to data quality. It is therefore not possible to maintain this information in the relational database as currently organized; conceivably a field containing a maximal set of status and quality information words for all images in the dataset could be added to each record. This note lays out an alternate plan, in which the availability and quality information is kept in ancillary (flat) files, either within or outside the individual datasets.

The Log Files

The various data information sources required may be organized as follows:
SOHO Status Log
a mission-long log providing information on the status of the SOHO spacecraft at any time. Most likely maintained as an event/transition list.

Instrument Status Log
a mission-long log providing information on the basic MDI instrument status and configuration at any time. The log consists of a set of state codes as a function of time, as described in the format section below. The instrument status is maintained in an appendable dataset, beginning with mission day 1065 (1995.12.02_00:00:00_TAI, start of day of SOHO launch).

Observing Logs (3)
There are three mission-long observing logs that record the IP data products available for each minute at various levels of processing: raw, level 0, and level 1.

Data Quality Logs (several)
individual mission-long data quality logs for various specific observables containing appropriate status and quality flag values. Separate data quality logs organized as time series with appropriate cadences should be provided for at least:

Summary Quality Files
Associated with and eventually contained in each dataset there should be one or more files providing a combination of status and quality information reflecting the contents of the appropriate logs, on a per-image basis.

File Locations, Names

The SOHO Status Log does not exist at this time.

The Instrument Status Log files are in the directory /soidata/info/mdi_log/lev0/MDI_log_01d, a link to space under DSDS control. Individual daily files are named XXXXXX.record.rdb, where XXXXXX is the mission day number. For each daily file there is an associated file XXXXXX.overview.fits. The directory contains one additional file, history.txt, an ASCII text history of updates to the logs.

All Observing and Data Quality Logs reside on the virtual directory /home/soi/logs. (This is currently a link to the physical directory /soidata/files/logs.) The observing logs are named:

obslog0
the per-minute log of available level 0 data products, organized by Reference Time (T_REF).
obslog1
the per-minute log of available level 1 data products, organized by Observation Time (T_OBS).
The raw logs remain to be specified. A sample of obslog0 covering a period of 454d 13h 36m starting Jan. 1, 1996 exists on /scr30/logs/96log.

The Data Quality Logs are named:

qual_fdV
data quality parameters for full-disc Dopplergrams, indexed by mission minute.
qual_fdIc
data quality parameters for full-disc Continuum photograms, indexed by mission minute.
qual_fdLd
data quality parameters for full-disc Line Depth grams, indexed by mission minute.
qual_fdM
data quality parameters for full-disc Magnetograms, indexed by mission minute.
qual_limb
data quality parameters for level 1 Limb Continuum data, indexed by 12-minute intervals.
qual_rwIc
data quality parameters for 128*128 binned Continuum photograms, indexed by 12-minute intervals.
qual_rwLd
data quality parameters for 128*128 binned Line Depth photograms, indexed by 12-minute intervals.
File names and organization for other observables remain to be specified.

The Summary Quality Files are included in Level 1.4/5 (?) datasets as described in the man page for the module gather_qual.

Path names, structure definitions, constants, and function prototypes specific to writing and reading, maintenance and interpretation of the Data Quality Logs are contained in the include file /CM/include/data_qual.h. Required functions are in the library /CM/lib/_machtype/libMDI.a.

Format

The Instrument Status Log is a collection of RDB files each containing 7 fields and 1440 records, one record per minute; each record is 93 bytes long. (Exception: the file for mission day 1066 contains 8 fields and each record is 113 bytes long.) The fields are:
  1. I_DREC : the mission minute number of the record
  2. DATAFILE : always blank
  3. T_REC : date_time string corresponding to the mission minute
  4. T_OBS : always blank
  5. T_REF : same as T_REC
  6. MDI_SW : a 10-character string representing the hexadecimal representation of the 32-bit MDI Status Word
  7. MDI_SSW : a 10-character string representing the hexadecimal representation of the 32-bit MDI Status Status Word
  8. NOTE : blank for every record except 1535831, which contains the string "MDI power turned on"; this field is usually absent.
(Apart from the entries in the history of updates, one day's worth of instrument status consumes 134106 bytes for the RDB file plus 2880 bytes for the FITS file, or 136986 bytes; this is nearly 12 times what a single direct access binary file containing just the Status Word and the Status Status Word with record number information from file location would consume for a day!)

The MDI Status Word and MDI Status Status Word are described in Tech Note 96-135.

The level-0 and level-1 Observing Logs are both binary files with 24-word "records", each record corresponding to one minute of time starting with 1996.01.01_00:00:00_TAI. Each record consists of a set of 4-byte words corresponding to all the DPC's whose Reference Times (for Level 0 data) or calculated Observing Times (for Level 1 data) fall within that minute. The records are padded with nulls so that the first null signifies the end of known data products for that minute. The order of entries within a record is immaterial. When multiple instances of a data product with the same code occur in the same minute, as may occur during high-cadence observing, only a single entry is made in the log. Separate higher-resolution logs may be produced for such data products on a case-by-case basis.

The per-observable Quality Logs are binary files with 2-word "records", each record corresponding to one observing interval appropriate to the observable. The first word of a record is the quality status word, and the second word is the quality status status word, detailing the validity of the corresponding bits in the quality status word. The meaning of indiviudal fields in the quality status word is described in the Data Quality Parameters section below.

Maintenance

The Instrument Status Log is updated by the module add2mdilog

The level-0 and level-1 Observing Logs are both populated and updated by the module add2obslog, from per-record information available in the appropriate permanent appendable data sets. They should be updated whenever the corresponding level of processing is performed on any dataset. However, the module is only capable of adding entries to the table. A separate routine may be necessary if it is ever necessary to withdraw an entry that may have been set by erroneous processing. Such a case has never occurred to my knowledge.

Data Quality Parameters

The Data Quality Logs for each observable contain certain bits of information common to all observables, plus additional bits specific to particular data products. Common data quality parameters are encoded in the following bits of the quality word (the bits are set to indicate the truth of the statement): The corresponding bits in the quality status word are: In general, an "ideal" observation would have most of the bit fields set, but this is not true for all of the fields, notably 7 - 9. A few notes of explanation:
  1. Bits 2 - 4 form a numeric pattern corresponding to the fraction of expected data values valid:
  2. A number of level 1 data products have incorrect statistics in their headers, due to processing errors presumably associated with scaling. Although these can and should be fixed by correcting the bugs and reprocessing, until that is done the header values cannot be reliably used for examining image statistics. Information about the statistics validity is included in bit 5.
  3. There is a non-gaussian tail to the distribution of values for certain observables possibly due to cosmic-ray hits; this tail typically affects about 0.005% of the pixels in full-disc Dopplergrams. Information about whether this tail has been examined and whether individual pixels have been marked statistically unlikely in separate files is included in bits 7 & 8.

Data quality parameters specific to individual data products should include:

  1. Data origin, e.g.:
  2. Other observable-specific quality parameters TBD
Bit fields specifically appropriate to the full-disc fdV, fdIc, and fdLd observables are as follows (the corresponding bits in the status word should be clear):

Library Function Specifications

The following or similar functions are suggested for manipulation of the observing and quality logs. Those marked with an asterisk have actually been implemented as functions in libMDI.
#include <data_qual.h>

unsigned int *get_data_products (TIME t, int level)
returns a malloc'd list of all available data products at the given processing level within the mission minute including the selected time.
int data_product_exists (unsigned int dpc, TIME t, int level)*
returns true if the selected data product is available at the given processing level within the mission minute including the selected time.
int data_product_exists_near (unsigned int dpc, TIME t, double delta, int level)*
returns true if the selected data product is available at the given processing level within the selected range of the selected time.
int data_exist (int datatype, TIME t, int level)*
returns true if any data products corresponding to the selected type are available at the given processing level within the mission minute including the selected time.
int update_data_quality (int observable, int index, unsigned int quality, unsigned int status, int replace, int timeout)*
updates or replaces the quality and status words at the indexed location for the corresponding observable. (A timeout is included in case of file locking problems.)
unsigned int data_quality_flag (int datatype, TIME t)
returns a 32-bit flag value that can be masked with various constants to determine the processing status and data quality for data of the selected type whose observed interval includes the selected time.
unsigned int MDI_configuration (TIME t)
returns a 32-bit flag value that can be masked with various constants to determine the MDI observing configuration and state at the selected observing time.
int clear_data_quality (int datatype, TIME t, int param)
sets the appropriate bit(s) for the selected parameter to indicate that the parameter is not valid (param_invalid) within the data quality flag for the given data product during the mission minute including the selected time.
int set_data_quality (int datatype, TIME t, int param, int value)
sets the appropriate bit(s) for the selected parameter to the selected value (and to param_valid) within the data quality flag for the given data product during the mission minute including the selected time.

Additional Library Functionss

The following could be implemented as macros testing the data quality flag. Those marked with an asterisk have actually been implemented as functions in libMDI. There are several other sample functions included there that are not described here. Some of these functions are already in use by production modules.
#include <data_qual.h>

stats_valid (int datatype, int level, int minute)*
returns STATS_VALID if the statistics are known valid, STATS_INVALID if the statistics are known invalid, and STATS_NOT_VALIDATED if the header statistics have not been verified.
excess_badpix (int datatype, int level, int minute)*
returns BADPIX_EXCESSIVE if the number of statistically unlikely values exceeds the maximum number tabulated per image, BADPIX_NOT_CHECKED if the bad pixels have not been tabulated, and the number of bad pixels otherwise.
OKforFDDop (int minute)*
returns 0 if the statistics are known invalid or if there is a framelist error for the Full-Disc Dopplergram for the minute; 1 otherwise. (It would be better to have this function return 1 only if the data were known to be acceptable, and to have another function that would return 1 if the data were knownto be unacceptable.)

Querying the Logs

Appendix I: Notes

I have a mission observing log (on /scr30/logs/96log at the moment).
It is a binary file with 24-word "records", each record corresponding
to one minute of time starting with 1996.01.01_00:00.  The contents of
each record are a set of 4-byte words corresponding to all the DPC's
with (calculated) OBS_TIME's (not REF_TIME's) within that minute,
plus padding nulls.  (I don't think we ever have more than 24 DPC's
in a minute.)  It's obviously easy to seek into this file to come up
with a list of DPC's observed during (or within a specified distance from)
a particular minute, or to scan through it to produce a set of observing
times corresponding to any collection of DPC's.  I have a few very simple
functions written, like

unsigned int *get_data_products (TIME t)
int data_product_exists (unsigned int dpc, TIME t)
int data_exist (int datatype, TIME t)

(see ~rick/soi/qual/continuous.c)

Currently the log is populated with a selected set of DPC's corresponding
to all the standard 5k data products and all the standard full-disk data
data products.  It can be (re-)populated for any given DPC with a module
(~rick/soi/qual/fill_obslog.c) that can take either a dataset name or the
name of the record.rdb file in the dataset as arguments, so it can be used
quickly from the files on /soidata/info/mdi_rec.  The log is populated from
the level 0 headers: it is intended to be an original source of information
about DPC's.  How far the corresponding data have been processed should be
provided as part of a quality flag in a separate set of files.  Ideally the
observing log would be populated in the dosciXk processing scripts immediately
ahead of the cpinfo calls.  There is no harm in rerunning the module with the
same input dataset multiple times, and the order in which the DPC's are added
is immaterial.

I am leaning to another collection of binary files (again organized on a one
per minute or other appropriate cadence basis), one for each major observable
(e.g. full-disk Doppler, Limb continuum).  For each minute there would be a
single word (of length TBD, but 32 bits is probably sufficient) describing
the level of processing and various statistical data quality parameters, plus
corrsponding validity bits.  This would be like the general quality flag that
Rock and I developed, but whereas that one (which I believe is being
implemented) focuses on instrument state, this one would use parameters
appropriate to individual observables.  It would likely be predominantly
filled at or immediately after the level 1 processing, but with some filling
at both level 0 and gather, and maybe even later.  Here is a sketch of a
pattern for say FDV, FDC, & FDL:

0:  60-sec "standard" DPC exists
1:  30-sec "standard" DPC exists in upper half of minute
2:  60-sec nonstandard DPC exists
3:  30-sec nonstandard DPC exists in upper half of minute
4:  30-sec DPC exists in lower half of minute
5:  data calibrated
6:  level 1 statistics validated
7:  > 0 values valid
8:  > 10% valid
9:  > 50% valid
a:  > 90% valid
b:  no missing values
c:  A < min < B  (x sigma in chi-squared, adjusted for expectation)
d:  C < max < D  (")
e:  E < mean < F (")
f:  instrument configuration "nominal"
10: instrument configuration inspected
11: data gathered to level 1.5
12: level 1.5 data reassembled from higher-resolution data
...