SOI Data Product Quality Logs
SOI TN 97-138
R S Bogart
1997.04.11
Introduction
Information concerning the availability and quality of various kinds of
data should always be available both interactively to guide users in
decisions regarding analysis and processing and during program execution
to enable automated selection procedures to be invoked. This document
describes the various files and interface procedures required to provide
such information.
The different purposes to which information on data quality and availability
can be put dictate different requirements on how and where the information
is kept. If the information is to be used to select particular datasets
for examination, processing, or analysis, then that information must be kept
in a separate location from the data themselves. On the other hand, quality
parameters that are associated with the data as an integral part of any
delivered or distributed data product are most efficiently kept within
the dataset itself. (It is theoretically possible to incorporate quality
information from outside as a step in data distribution, but this would be
complex and might have major ramifications for the operation of the DSDS.)
Recall that a dataset is equivalent to a directory containing files
corresponding to individual data, images, sequences, etc., and
that the dataset is the atomic unit at which the DSDS database is organized.
The typical full SOI dataset contains 60 to 1440 images (files), each of
which must be individually characterized as to data quality. It is therefore
not possible to maintain this information in the relational database as
currently organized; conceivably a field containing a maximal set of status
and quality information words for all images in the dataset could be added to
each record. This note lays out an alternate plan, in which the availability
and quality information is kept in ancillary (flat) files, either within or
outside the individual datasets.
The Log Files
The various data information sources required may be organized as follows:
- SOHO Status Log
- a mission-long log providing information on the status of the SOHO
spacecraft at any time. Most likely maintained as an event/transition list.
- Instrument Status Log
- a mission-long log providing information on the basic MDI instrument
status and configuration at any time. The log consists of a set of state
codes as a function of time, as described in the format section below. The
instrument status is maintained in an appendable dataset, beginning with
mission day 1065 (1995.12.02_00:00:00_TAI, start of day of SOHO launch).
- Observing Logs (3)
-
There are three mission-long observing logs that record the IP data products
available for each minute at various levels of processing: raw, level 0,
and level 1.
- Data Quality Logs (several)
- individual mission-long data quality logs for various specific
observables containing appropriate status and quality flag values. Separate
data quality logs organized as time series with appropriate cadences should
be provided for at least:
- Full-Disc Dopplergrams (1 minute)
- Full-Disc Photograms (1 minute)
- Full-Disc Line-Depth grams (1 minute)
- Full-Disc Magnetograms (96 minute)
- Limb Photograms (12 minute)
- Medium-l Dopplergrams (1 minute)
- 128*128 Photograms (12 minute)
- 128*128 Line-Depth grams (12 minute)
- LOI-mask Dopplergrams (1 minute)
- LOI-mask Photograms (1 minute)
- Summary Quality Files
- Associated with and eventually contained in each dataset there should be
one or more files providing a combination of status and quality information
reflecting the contents of the appropriate logs, on a per-image basis.
File Locations, Names
The SOHO Status Log does not exist at this time.
The Instrument Status Log files are in the directory
/soidata/info/mdi_log/lev0/MDI_log_01d, a link to space
under DSDS control. Individual daily files are named XXXXXX.record.rdb,
where XXXXXX is the mission day number. For each daily file there is
an associated file XXXXXX.overview.fits. The directory contains
one additional file, history.txt, an ASCII text history of updates to the
logs.
All Observing and Data Quality Logs reside on the virtual directory
/home/soi/logs. (This is currently a link to the physical
directory /soidata/files/logs.)
The observing logs are named:
- obslog0
- the per-minute log of available level 0 data products, organized by
Reference Time (T_REF).
- obslog1
- the per-minute log of available level 1 data products, organized by
Observation Time (T_OBS).
The raw logs remain to be specified. A sample of obslog0 covering
a period of 454d 13h 36m starting Jan. 1, 1996 exists on
/scr30/logs/96log.
The Data Quality Logs are named:
- qual_fdV
- data quality parameters for full-disc Dopplergrams, indexed by mission
minute.
- qual_fdIc
- data quality parameters for full-disc Continuum photograms, indexed by
mission minute.
- qual_fdLd
- data quality parameters for full-disc Line Depth grams, indexed by mission
minute.
- qual_fdM
- data quality parameters for full-disc Magnetograms, indexed by mission
minute.
- qual_limb
- data quality parameters for level 1 Limb Continuum data, indexed by
12-minute intervals.
- qual_rwIc
- data quality parameters for 128*128 binned Continuum photograms, indexed by
12-minute intervals.
- qual_rwLd
- data quality parameters for 128*128 binned Line Depth photograms, indexed by
12-minute intervals.
File names and organization for other observables remain to be specified.
The Summary Quality Files are included in Level 1.4/5 (?) datasets as
described in the man page for the module gather_qual.
Path names, structure definitions, constants, and function prototypes
specific to writing and reading, maintenance and interpretation of the Data
Quality Logs are contained in the include file
/CM/include/data_qual.h. Required functions are in the
library /CM/lib/_machtype/libMDI.a.
Format
The Instrument Status Log is a collection of RDB files each containing
7 fields and 1440 records, one record per minute; each record is 93 bytes long.
(Exception: the file for mission day 1066 contains 8 fields and each record
is 113 bytes long.) The fields are:
- I_DREC : the mission minute number of the record
- DATAFILE : always blank
- T_REC : date_time string corresponding to the mission minute
- T_OBS : always blank
- T_REF : same as T_REC
- MDI_SW : a 10-character string representing the hexadecimal
representation of the 32-bit MDI Status Word
- MDI_SSW : a 10-character string representing the hexadecimal
representation of the 32-bit MDI Status Status Word
- NOTE : blank for every record except 1535831, which
contains the string "MDI power turned on"; this field is usually absent.
(Apart from the entries in the history of updates, one day's worth of
instrument status consumes 134106 bytes for the RDB file plus 2880 bytes
for the FITS file, or 136986 bytes; this is nearly 12 times what a single
direct access binary file containing just the Status Word and the Status
Status Word with record number information from file location would consume
for a day!)
The MDI Status Word and MDI Status Status Word are described in
Tech Note 96-135.
The level-0 and level-1 Observing Logs are both binary files with 24-word
"records", each record corresponding to one minute of time starting with
1996.01.01_00:00:00_TAI. Each record consists of a set of 4-byte words
corresponding to all the DPC's whose Reference Times (for Level 0 data)
or calculated Observing Times (for Level 1 data) fall within that minute.
The records are padded with nulls so that the first null signifies the
end of known data products for that minute. The order of entries within
a record is immaterial. When multiple instances of a data product with
the same code occur in the same minute, as may occur during high-cadence
observing, only a single entry is made in the log. Separate higher-resolution
logs may be produced for such data products on a case-by-case basis.
The per-observable Quality Logs are binary files with 2-word "records",
each record corresponding to one observing interval appropriate to the
observable. The first word of a record is the quality status word,
and the second word is the quality status status word, detailing the
validity of the corresponding bits in the quality status word.
The meaning of indiviudal fields in the quality status word is described
in the Data Quality Parameters section below.
Maintenance
The Instrument Status Log is updated by the module
add2mdilog
The level-0 and level-1 Observing Logs are both populated and updated
by the module
add2obslog,
from per-record information available in the appropriate permanent
appendable data sets. They should be updated whenever the corresponding
level of processing is performed on any dataset. However, the module
is only capable of adding entries to the table. A separate routine may
be necessary if it is ever necessary to withdraw an entry that
may have been set by erroneous processing. Such a case has never occurred
to my knowledge.
Data Quality Parameters
The Data Quality Logs for each observable contain certain bits of
information common to all observables, plus additional bits
specific to particular data products. Common data quality parameters
are encoded in the following bits of the quality word (the bits are set
to indicate the truth of the statement):
- 00 Data calibrated (Level 1 data exist)
- 01 Data gathered (Level 1.5 data exist)
- 02 Fraction of expected data values valid (see below)
- 03 Fraction of expected data values valid (ditto)
- 04 Fraction of expected data values valid (ditto)
- 05 Level 1 statistics valid
- 06 Instrument configuration ``nominal'' for data product
- 07 Some valid pixels marked statistically unlikely
- 08 > x% of valid pixels marked statistically unlikely
(x ~ 0.01)
- 09 Framelist error - wrong observable
- 10 - 15 Reserved for future use
The corresponding bits in the quality status word are:
- 00 Existence of Level 1 data checked
- 01 Existence of Level 1.5 data checked
- 02 Fraction of expected values valid determined
- 03, 04 Unused
- 05 Level 1 statistics validated
- 06 Instrument configuration determined
- 07 Level 1 statistical quality determined
- 08 Unused
- 09 Framelist checked
- 10 - 15 Reserved for future use
In general, an "ideal" observation would have most of the bit fields set,
but this is not true for all of the fields, notably 7 - 9. A few notes of
explanation:
- Bits 2 - 4 form a numeric pattern corresponding to the fraction of
expected data values valid:
- 0 - No data values valid
- 1 - < 10% of data values valid
- 2 - < 25% of data values valid
- 3 - < 50% of data values valid
- 4 - < 75% of data values valid
- 5 - < 90% of data values valid
- 6 - < 100% of data values valid
- 7 - All data values valid
- A number of level 1 data products have incorrect
statistics in their headers, due to processing errors presumably associated
with scaling. Although these can and should be fixed by correcting the bugs
and reprocessing, until that is done the header values cannot be reliably used
for examining image statistics. Information about the statistics validity
is included in bit 5.
- There is a non-gaussian tail to the
distribution of values for certain observables possibly due to cosmic-ray
hits; this tail typically affects about 0.005% of the pixels in full-disc
Dopplergrams. Information about whether this tail has been examined and
whether individual pixels have been marked statistically unlikely in separate
files is included in bits 7 & 8.
Data quality parameters specific to individual data products should include:
- Data origin, e.g.:
- Data from 60-sec "standard" Data Product
- Data from 30-sec "standard" Data Product in upper half of minute
- Data from 60-sec nonstandard Data Product
- Data from 30-sec nonstandard Data Product in upper half of minute
- Data from 30-sec Data Product in lower half of minute
- Data recomputed on ground from higher-resolution Data Product
- Other observable-specific quality parameters TBD
Bit fields specifically appropriate to the full-disc fdV, fdIc, and fdLd
observables are as follows (the corresponding bits in the status word
should be clear):
- 16 Data from full 60-second integration
Library Function Specifications
The following or similar functions are suggested for manipulation of the
observing and quality logs. Those marked with an asterisk have actually
been implemented as functions in libMDI.
- #include <data_qual.h>
- unsigned int *get_data_products (TIME t, int level)
- returns a malloc'd list of all available data products at the given
processing level within the mission minute including the selected time.
- int data_product_exists (unsigned int dpc, TIME t, int level)*
- returns true if the selected data product is available at the given
processing level within the mission minute including the selected time.
- int data_product_exists_near
(unsigned int dpc, TIME t, double delta, int level)*
- returns true if the selected data product is available at the given
processing level within the selected range of the selected time.
- int data_exist (int datatype, TIME t, int level)*
- returns true if any data products corresponding to the selected type
are available at the given processing level within the mission minute
including the selected time.
- int update_data_quality (int observable, int index, unsigned int quality,
unsigned int status, int replace, int timeout)*
- updates or replaces the quality and status words at the indexed location
for the corresponding observable. (A timeout is included in case of file
locking problems.)
- unsigned int data_quality_flag (int datatype, TIME t)
- returns a 32-bit flag value that can be masked with various constants
to determine the processing status and data quality for data of the
selected type whose observed interval includes the selected time.
- unsigned int MDI_configuration (TIME t)
- returns a 32-bit flag value that can be masked with various constants
to determine the MDI observing configuration and state at the selected
observing time.
- int clear_data_quality (int datatype, TIME t, int param)
- sets the appropriate bit(s) for the selected parameter to indicate that
the parameter is not valid (param_invalid) within the data quality flag for
the given data product during the mission minute including the selected time.
- int set_data_quality (int datatype, TIME t, int param, int value)
- sets the appropriate bit(s) for the selected parameter to the selected
value (and to param_valid) within the data quality flag for the given
data product during the mission minute including the selected time.
Additional Library Functionss
The following could be implemented as macros testing the data quality flag.
Those marked with an asterisk have actually been implemented as functions
in libMDI. There are several other sample functions included there
that are not described here. Some of these functions are already in use by
production modules.
- #include <data_qual.h>
- stats_valid (int datatype, int level, int minute)*
- returns STATS_VALID if the statistics are known valid,
STATS_INVALID if the statistics are known invalid, and
STATS_NOT_VALIDATED if the header statistics have not been verified.
- excess_badpix (int datatype, int level, int minute)*
- returns BADPIX_EXCESSIVE if the number of statistically unlikely
values exceeds the maximum number tabulated per image,
BADPIX_NOT_CHECKED if the bad pixels have not been tabulated, and
the number of bad pixels otherwise.
- OKforFDDop (int minute)*
- returns 0 if the statistics are known invalid or if there is a framelist
error for the Full-Disc Dopplergram for the minute; 1 otherwise. (It would
be better to have this function return 1 only if the data were known to
be acceptable, and to have another function that would return 1 if the data
were knownto be unacceptable.)
Querying the Logs
Appendix I: Notes
I have a mission observing log (on /scr30/logs/96log at the moment).
It is a binary file with 24-word "records", each record corresponding
to one minute of time starting with 1996.01.01_00:00. The contents of
each record are a set of 4-byte words corresponding to all the DPC's
with (calculated) OBS_TIME's (not REF_TIME's) within that minute,
plus padding nulls. (I don't think we ever have more than 24 DPC's
in a minute.) It's obviously easy to seek into this file to come up
with a list of DPC's observed during (or within a specified distance from)
a particular minute, or to scan through it to produce a set of observing
times corresponding to any collection of DPC's. I have a few very simple
functions written, like
unsigned int *get_data_products (TIME t)
int data_product_exists (unsigned int dpc, TIME t)
int data_exist (int datatype, TIME t)
(see ~rick/soi/qual/continuous.c)
Currently the log is populated with a selected set of DPC's corresponding
to all the standard 5k data products and all the standard full-disk data
data products. It can be (re-)populated for any given DPC with a module
(~rick/soi/qual/fill_obslog.c) that can take either a dataset name or the
name of the record.rdb file in the dataset as arguments, so it can be used
quickly from the files on /soidata/info/mdi_rec. The log is populated from
the level 0 headers: it is intended to be an original source of information
about DPC's. How far the corresponding data have been processed should be
provided as part of a quality flag in a separate set of files. Ideally the
observing log would be populated in the dosciXk processing scripts immediately
ahead of the cpinfo calls. There is no harm in rerunning the module with the
same input dataset multiple times, and the order in which the DPC's are added
is immaterial.
I am leaning to another collection of binary files (again organized on a one
per minute or other appropriate cadence basis), one for each major observable
(e.g. full-disk Doppler, Limb continuum). For each minute there would be a
single word (of length TBD, but 32 bits is probably sufficient) describing
the level of processing and various statistical data quality parameters, plus
corrsponding validity bits. This would be like the general quality flag that
Rock and I developed, but whereas that one (which I believe is being
implemented) focuses on instrument state, this one would use parameters
appropriate to individual observables. It would likely be predominantly
filled at or immediately after the level 1 processing, but with some filling
at both level 0 and gather, and maybe even later. Here is a sketch of a
pattern for say FDV, FDC, & FDL:
0: 60-sec "standard" DPC exists
1: 30-sec "standard" DPC exists in upper half of minute
2: 60-sec nonstandard DPC exists
3: 30-sec nonstandard DPC exists in upper half of minute
4: 30-sec DPC exists in lower half of minute
5: data calibrated
6: level 1 statistics validated
7: > 0 values valid
8: > 10% valid
9: > 50% valid
a: > 90% valid
b: no missing values
c: A < min < B (x sigma in chi-squared, adjusted for expectation)
d: C < max < D (")
e: E < mean < F (")
f: instrument configuration "nominal"
10: instrument configuration inspected
11: data gathered to level 1.5
12: level 1.5 data reassembled from higher-resolution data
...