Notes on Processing TON Data

SOI TN 01-145
R S Bogart
2001.08.09

Introduction

This note documents the current status of the efforts toward acquisition and processing of the Taiwan Oscillations Network (TON) data through the SOI Science Support Center (SSSC). The aim is to provide the background for design and development of the procedures for a production processing pipeline for the data at least through Level 1 (calibration), and eventually through data merging and analysis. The general aims and responsibilites of the project are described in SOI Technical Note 98-140.

The data files and programs referred to in this note are currently scattered in a variety of personal and scratch-space directories, principally ~rick/ton and /surge2/ton/. A considerable amount of raw data and Level-0 data have already been ingested into the DSDS under prog:ton as described in TN 98-140.

Background

The nominal form of the TON raw data is as described in TN 98-140. Because of repeated and frequent problems with Exabyte tape reads, the TON project has made available files containing the headers of all images on each tape, to be used for verification of the tape extraction. These files are on the TON anonymous ftp site at

140.114.80.234 = astro4.phys.nthu.edu.tw

under the directories pub/ton-header/XX, where XX is one of the four standard TON site codes: bb, hr, tf, and ub. The most recent update to these header files at the ftp site was Dec. 11, 2000. The entire contents of these directories were copied to /surge2/ton/tape_lists on Dec. 19, 2000.

The general outline of the required data processing steps is:

Read the data from the Exabyte tar tapes as they are sent
Compare the tape contents with the appropriate verification header files - obtain the verification file and fix as necessary
Ingest the validated raw data
Convert the raw data to Level-0 (FITS files) and ingest
Obtain, interpret, and ingest the required calibration data - these are also available on the TON ftp site
Calibrate the Level-0 data and ingest the corresponding Level-1 data
Merge the Level-1 data into a Level 1.5 data stream and ingest

These steps are each discussed in detail below.

Reading the Tapes

[Hao, Jeneen: describe tar scripts, locations of scratch directories, space and time requirements, machines...]

Verification from the Header Files

In principle, it should be straightforward to make a list of the expected images and their key characteristics from the header files. I have a module tonexpect that will report the times of the first and last images of each day, the corresponding sequence numbers, and the numbers of images per day and on the tape. Sample output from a run:

}> tonexpect file= /surge2/ton/tape_lists/bb/bb008.dat
summary of /surge2/ton/tape_lists/bb/bb008.dat:
1994.07.10 14:43:56 ( 883) - 23:59:56 (1439) 339
1994.07.11 00:00:56 ( 0) - 23:59:56 (1439) 690
1994.07.12 00:00:56 ( 0) - 23:59:56 (1439) 702
1994.07.13 00:00:56 ( 0) - 01:53:56 ( 113) 115
1846 images total

(Note that depending on the longitude of the site, the roughly contiguous data from daylight hours may span two UT days; we have decided to organize all the data by UT day corresponding approximately to SOI mission day. A typical Big Bear data set will thus contain 8-12 hours of data in two separate blocks, one at the beginning of the day and the other at the end.)

The format of the headers is not uniform; there are several different variants, described under the section on Level-0 processing below, and the module needs to be trained to recognize the different variants as they come up. There may still be cases of unknown formats, especially among the earlier data.

The images are normally time-ordered in the tape header files, although occasionally headers from groups of days are replicated. The program detects this of course. For example:

}> tonexpect file= /surge2/ton/tape_lists/tf/tf129.dat
summary of /surge2/ton/tape_lists/tf/tf129.dat:
1995.05.05 16:15:56 ( 975) - 19:25:56 (1165) 191
1995.05.06 06:53:56 ( 413) - 16:26:56 ( 986) 574
1995.05.07 06:28:56 ( 388) - 19:32:56 (1172) 785
1995.05.08 07:34:56 ( 454) - 19:25:56 (1165) 558
1995.05.06 06:53:56 ( 413) - 16:26:56 ( 986) 574
1995.05.07 06:28:56 ( 388) - 19:32:56 (1172) 785
1995.05.08 07:34:56 ( 454) - 13:39:56 ( 819) 367
3834 images total

(There are also cases in which the entries are clearly spurious: for example, ub031.dat starts with 12 headers dated 1970.01.01 before the next ones dated 1996.11.24.) The files can be suitably edited to remove the duplicate entries; there are about a dozen such cases I have noted so far, of the 690 verification files on line. Of more concern is the fact that there are various errors in the files that cause the interpreting program to fail:

}> tonexpect file= /surge2/ton/tape_lists/hr/hr135.dat
summary of /surge2/ton/tape_lists/hr/hr135.dat:
1996.07.03 23:33:56 (1413) - 23:59:56 (1439) 28
1996.07.04 00:00:56 ( 0) - 23:59:56 (1439) 518
1996.07.06 00:00:56 ( 0) -
unexpected string @ line 2103:
; O(551.2,734.6) B=18552 R=493.8 E= 9 Sx=0.01217 Sy=0.01627 Av=17650.2 No=1052

08:23:56 ( 503) 505
1051+ images total

The typical problem is that the newline character was dropped from the header, often with some or all of the digits of the preceding sequence number (which can be inferred). The files can be edited and the interpreting program rerun until all such errors are removed. I have placed copies of the files to be edited on /surge2/ton/fixed_lists. Thus,

}> tonexpect file= /surge2/ton/tape_lists/ub/ub014.dat
summary of /surge2/ton/tape_lists/ub/ub014.dat:
1996.08.10 02:24:56 ( 144) - 13:23:56 ( 803) 660
1996.08.11 02:36:56 ( 156) - 13:23:56 ( 803) 648
1996.08.12 02:21:56 ( 141) - 13:21:56 ( 801) 656
1996.08.13 01:55:56 ( 115) -
unexpected string @ line 4945:
; O(540.0,540.0) B= 0 R= 0.0 E= 2 Sx=0.00687 Sy=0.00818 Av= 980.4 No=2474

10:21:56 ( 621) 508
2472+ images total
}> tonexpect file= /surge2/ton/fixed_lists/ub/ub014.dat
summary of /surge2/ton/fixed_lists/ub/ub014.dat:
1996.08.10 02:24:56 ( 144) - 13:23:56 ( 803) 660
1996.08.11 02:36:56 ( 156) - 13:23:56 ( 803) 648
1996.08.12 02:21:56 ( 141) - 13:21:56 ( 801) 656
1996.08.13 01:55:56 ( 115) - 10:43:56 ( 643) 530
2494 images total

There is a log of the edits made on ~rick/ton/fixlog and summary output for all verification files of each site XX on ~rick/ton/sum.XX - these are made by the script mksum. A run of this script for a particular site will produce a list of the files that still need to be edited and a summary of the number of `good' and `bad' files:

}> mksum hr
hr079.dat
hr093.dat
hr : 76 good, 2 bad

Once the files are `fixed', the tonexpect module could be modified to write out more detailed information per expected image (to a file) if that is useful for processing.

Another problem occurs in certain verification files in which the headers for images from a given day are repeated without intervening days. Examples of such files are ub010 and ub081:

summary of /surge2/ton/fixed_lists/ub/ub010.dat:
1996.07.30 02:55:56 ( 175) - 11:19:56 ( 679) 480
1996.07.30 02:55:56 ( 175) - 11:18:56 ( 678) 480
960 images total
summary of /surge2/ton/fixed_lists/ub/ub081.dat:
1998.04.16 03:41:56 ( 221) - 12:10:56 ( 730) 502
1998.04.16 03:42:56 ( 222) - 12:50:56 ( 770) 541
1998.04.17 02:51:56 ( 171) - 09:16:56 ( 556) 386
1998.04.18 04:59:56 ( 299) - 10:40:56 ( 640) 339
1998.04.20 06:59:56 ( 419) - 11:58:56 ( 718) 300
1998.04.23 10:50:56 ( 650) - 12:24:56 ( 744) 95
1998.04.27 03:19:56 ( 199) - 13:01:56 ( 781) 583
1998.04.28 08:18:56 ( 498) - 11:27:56 ( 687) 191
2937 images total

In these cases the sequence numbering in the headers continues (normally they restart at 1 on a new day), so that the headers for images with the same time are not exactly identical. It is not clear if this is significant. There could in fact be duplicated images in these cases, but there probably aren't. We need to look at a case when we have a corresponding tar tape.

Occasionally individual image header times are repeated even though the images are apparently distinct based on the other parameters in the header. For example, there are two entries with reference time 1996/07/21_00:11:56020 in hr136.dat. There is only one actual image in the corresponding tar directory, and it corresponds to the second of the two headers in the verification file.

A few of the header files do not bear the names of the corresponding tapes: hreclipse.dat contains the headers from tape HR 163, for example, and hr148b.dat those from HR 159. ub009a.dat appears to contain only a subset of the contents of ub009..dat

Expectation logs

By running the tonexpect module with the parameter log=dir, a set of files will be written (or appended) in the directory dir. These files will be named with a number corresponding to the appropriate day-number under which the data would be ingested. Each file contains a list of images, one per line, with the tape ID and the file name separated by a space. For example,

}> tonexpect file= /surge2/ton/fixed_lists/hr/hr135.dat log= /surge2/ton/verify/hr

would produce (or append to) files named 1279, 1280, 1281, 1282, 1283, 1288, and 1292 in the directory /surge2/ton/verify/hr.

}> cat /surge2/ton/verify/hr/1279
HR135 960703.1414
HR135 960703.1415
HR135 960703.1415
HR135 960703.1416
...
HR135 960703.1438
HR135 960703.1439
HR135 960703.1440

(Note the repeated entries corresponding to a pair of distinct image headers with the same reference time in the verification file!)

Raw Data Ingestion

Extracted data will reside in subdirectories labeled by the date in form YYMMDD; for example, the tar extract of tape bb192 contains 3 directories, 960806, 960807, and 960808. Each day's subdirectory contains only the raw data files for that day, named YYMMDD.M, where M is the 1-4 digit minute number. These directories, if verified as being complete, can simply be ingested as is into prog:ton,level:raw,series:SS,[D], where SS is the site code and D is the SOI day number of the corresponding date.

Level 0 Processing

The module rdton converts the files in a TON raw data set to FITS format with appropriate keywords, and places them in a conforming output data set of type FITS/RDB. The raw data format consists of a mixed ASCII and binary integer header followed by a data section of 1080² 2-byte unsigned big-endian integers representing the data in row-column order, i.e. transposed from normal FITS storage order. All pixel values are assumed valid, although the value of pixel [0,1079] in the southwest corner is typically bad. Occasionally there are incomplete images. The module should write out missing values as 0 and set the datatype to unsigned shorts, but this needs to be verified. Right now it looks like it is trying to set the datatype to scaled floats (this may be due to a bug in the FITS I/O routines), and the missing values are garbage.

Types of Headers

The nominal file header is described in TN 98-140:

A 25-character null-terminated identification string, e.g.
260_016:40:00W_028:18:00N
A null-terminated string of the following format:
N No. 1, 617 GMT1996/08/01_18:43:55670 O(538,542) M( 2, -2) B= 4736 R=496 T=226 E=0 size=1080,1080 expousre=1500 ms
A newline character
A null-terminated string of the following format:
; O(537.6,541.7) B= 4736 R=496.0 E= 0 Sx=0.01011 Sy=0.01301 Av= 5197.9 No= 1

However, there are at least three variant forms present:

bb008_016:40:00W_028:18:00N Time:GMT1994/07/10_14:43:56020 No 1 1080:1080 exposure time:800 ms
; O(570.2,522.8) B=15299 R=494.9 E= 0 Sx=0.01838 Sy=0.02363 Av=16567.8 No= 1
bb063_116:55:00W_034:15:12N No. 1, 76 GMT1995/04/29_17:05:56020 O(536,546) M( -1, -9) B=23599 R=496 T=224 E=6 size=1080,1080 expousre= 800 ms
; O(536.3,546.1) B=23599 R=496.0 E= 0 Sx=0.01885 Sy=0.02450 Av=24955.3 No= 1
\033[DHR093_116:30:00E_040:12: No. 1, 470 GMT1995/11/24_08:07:56020 O(542,540) M( 3, 19) B= 1440 R=504 T=226 E=6 size=1080,1080 expousre= 800 ms
; O(540.0,540.0) B= 0 R= 0.0 E= 2 Sx=0.00735 Sy=0.00904 Av= 962.6 No= 1

(Note the 3-character escape sequence at the beginning of the tape identifer of HR093 and the missing three characters at its end.) There are other headers that contain extra nulls, as described in rdton.c.

One problem with the existing module is that the serial numbers in the level-0 data set are based on the minute number of the reference time, which is generally, maybe always, one less than the minute number of the observing time, since the exposures are set to be centered on the minute tick, and the clock is started about 4 seconds earlier.

Calibration Data

Calibration images are available for various days (typically one or two per week per site) on the TON ftp site, under the directories pub/ton/ton-dc-ff/dc (dark current) and pub/ton/ton-dc-ff/ff (flat field). Dark current images for a given day YYMMDD and site SS are named ssYYMMDD.dc.pc, e.g. bb960607.dc.pc; corresponding flat-field images are named ssYYMMDD.ff.kxky6, e.g. bb960607.ff.kxky6. There are not at this time any calibration images for the Huairou site.

Calibration

Calibration to Level-1 data is simply a matter of dark-correcting and flat-fielding the Level-0 images on a pixel-by-pixel basis with the appropriate values from nearby dark current and flat-field images. The formula for flatfielding is:

 I¹_ij =  (I⁰_ij - DC_ij)/FF_ij

Site Merging

This page last revised Wednesday, 15-Aug-2001 15:08:13 PDT