Archiving Data from the Taiwan Oscillations Network

SOI TN 98-140
R S Bogart & D Y Chou
1998.12.08

Introduction

This document provides a design and interface specification for the archiving and processing of full-disc, high-resolution Ca-K line intensity data from the Taiwan Oscillations Network (TON). The aim of the project is to provide an ongoing publicly-accessible archive of original and processed data from the network.

Data Flow

The overall data flow is summarized in the following steps:

Production of raw data tapes in original format: this will continue to be carried out as at present by the TON staff, with the raw data permanently archived at Tsing Hua University on exabyte tapes. Copies of these tapes will be produced by the TON staff and mailed to the Stanford Helioseismology Archive as they become available. Upon receipt they will be archived at Stanford as well, from where the data will be publicly distributed.
Creation of Level 0 data: the raw data will be converted to FITS format at Stanford and permanently archived.
Calibration: the Level 0 data will be calibrated and archived in FITS format at Stanford, following procedures designed jointly by the scientists and programmers at Tsinghua University, Stanford and other TON member sites.
Merging: the Level 1 data from the different TON sites may eventually be merged into a single data set, following procedures to be designed by the TON team.
Helioseismology processing: the Level 1 data will be available for processing through the same pipeline modules in use at Stanford for processing MDI data to such products as spherical-harmonic amplitudes, mode frequencies, ring-diagram and time-distance data sets, and archived at Stanford. Details of this processing are to be negotiated and specified in future.

Data flow notes

The exabyte tapes are written at an unknown block size, TBD. Using a variable block length to read them at Stanford requires about 20 seconds per file, or about 100 kB/sec. Keeping pace with raw data input at 100% coverage would require 8 hours per day of tape read time at this rate.

Estimated data volume

The TON observations began ???. They currently proceed at four sites with an average of ??? site-days per year. Daily observations at each site typically last ??? hours, producing up to 60 1080*1080 pixel photograms per hour. The photograms are 16 bits deep. The calibrated (level 1) images are 1051*1051. The average coverage for the entire network is ??%. The overall data volume is thus:

Raw: ??? MB / site-day (dataset); ??? GB / day; ??? TB / yr; ??? TB to date
Level 0: ??? MB / site-day (dataset); ??? GB / day; ??? TB / yr; ??? TB to date
Level 1: ??? MB / site-day (dataset); ??? GB / day; ??? TB / yr; ??? TB to date
Level 1 Merged: ??? MB / hour (dataset); ??? GB / day; ??? TB / yr; ??? TB to date

Data Product Descriptions

Raw data

The raw data tapes each contain data from a particular site for one to a few days. Each tape contains several tar files, one per day of observations at the site. The tar file includes a single directory containing a set of identically formatted image files, one per image (minute). The directories are named by the day of observation (e.g. 960801) and the files by the day and UT minute number of the observation (e.g. 960801.433, 960801.1129). (Observations from sites spanning multiple UT days within the observing day are split onto separate directories; those from the same UT day but different observing days are combined in the same directory.) IS THIS TRUE? The raw file format is a 1028 byte header followed by 1080*1080 2-byte unsigned little-endian unsigned shorts representing the data values. The file is null-padded with 892 bytes to a total of 2334720 bytes.

The header structure is a 1024 byte ASCII string followed by two 2-byte little-endian short integers representing the number of columns and rows in the data; these should always both be 1080 (0x0438). The ASCII string consists of the following:

A 25-character null-terminated identification string, e.g.
260_016:40:00W_028:18:00N
A null-terminated string of the following format:
N No. 1, 617 GMT1996/08/01_18:43:55670 O(538,542) M( 2, -2) B= 4736 R=496 T=226 E=0 size=1080,1080 expousre=1500 ms
A newline character
A null-terminated string of the following format:
; O(537.6,541.7) B= 4736 R=496.0 E= 0 Sx=0.01011 Sy=0.01301 Av= 5197.9 No= 1

The archived Raw data sets will be organized under dataset names prog:ton,level:raw,series:site[day-number], with an epoch for the day number early enough to assure non-negative values in the archive; 1993.01.01_00:00:00_TAI, the MDI epoch, will suffice. The site name can take on one of the following values (more may be added):

bb : Big Bear Observatory
hr : Huairou Observatory
tf : Tenerife
ub : Uzbekistan

Level 0 data

The Level 0 data consist of FITS files organized in directories containing all images observed within a given UT day at a given site. These will include all filtergrams, darks, and calibration (diffuser) images. The file naming convention is YYMMDD.mmmm.fits, where the file name reflects the nominal observation time (UT minute) of the image. In addition to the required FITS keywords SIMPLE, BITPIX, NAXIS, and NAXISn, the data should contain the following keywords:

T_OBS: observation time (string), in format YYYY.MM.DD_hh:mm:ss_TYP, where TYP is UT, UTC, TAI, PST, or PDT.
TYPE: type of filtergram (string); valid values are "obs", "dark", or "cal".
ORIENT: image orientation (string); "SESW" if image is direct, "NENW" if inverted

The Level 0 data are 1080*1080 uncropped images. They should provide a primary data source from which any modifications to the analysis procedure (e.g. changes in the registration and calibration algorithms) can proceed; the Raw data effectively supply a backup archive.

The archived Level 0 data sets will be organized under dataset names prog:ton,level:lev0,series:site[day-number], with the same epoch for the day number as for the raw data. The data directories will also require overview.fits and record.rdb files with the header info to make them "conforming" data sets.

Level 1 data

One Level 1 data product is planned, calibrated intensity in the same FITS format and directory arrangement as the Level 0 data.

The calibrated photograms will be produced by techniques TBD. Calibration will involve flat-fielding, normalization of the intensity values, and registration of the images to a fixed location on a 1051*1051 pixel grid. The following keywords must be supplied in the Level 1 data (for detailed discussion of their meaning, see Technical Note 95.122):

DSNAME:
PROTOCOL:
CONFORMS:
T_BLOCK:
T_START:
T_STOP:
T_EPOCH:
DATAFILE:
BLDVER10:
SOURCE:
I_DREC:
T_REC:
T_OBS:
INTERVAL:
DATASIGN:
S_MAJOR:
S_MINOR:
S_ANGLE:
ORIENT:
IM_SCALE:
XSCALE:
YSCALE:
XCEN:
YCEN:
X0:
Y0:
DATAMIN:
DATAMAX:
DATAMEAN:
DATA_RMS:
DATASKEW:
DATAKURT:
DATAMEDN:
DATAVALS:
MISSVALS:
OBS_B0:
OBS_L0:
OBS_DIST:
OBS_VR:
OBS_VW:
OBS_VN:
OBS_R0:
ORIGIN:
TELESCOP:
INSTRUME:
DATE_OBS:
SOLAR_P0:
R_SUN:

Additional keywords may be required.

The archived Level 1 data sets will be organized under dataset names prog:ton,level:lev1,series:site[day-number], with the same epoch for the day number as for the Level 0 data. The data sets will be conforming FITS_RDB, and TS_EQ (image files numbered corresponding to the minute of the hour and mainly blank records in the RDB file corresponding to minutes for which no data image exists). Individual files will be named ???.

Level 1 Merged data

TBD

Level 2 data products

The Level 2 data products to be produced should be in the same format and organization as the corresponding products generated from MDI full-disc data, and should include at a minimum spherical harmonic mode amplitudes, which would be archived under dataset names prog:ton,level:lev2_shc,series:V_l0-l1_01d[day-number].

This page last revised Thursday, 09-Aug-2001 15:57:35 PDT