The SEPEM reference proton dataset

In order to perform meaningful statistical analyses on SEP events, a contiguous, uniform, high quality dataset is needed, which spans as many solar cycles as possible.

In practice, there are only two data sources that can fullfill the above criteria: the IMP8/GME data, and the SEM(-2) data from successive GOES spacecraft (click here for an overview of the data sources that have been ingested in SEPEM). Even these datasets need substantial processing before they can be used in statistical processing, in order to:

After correcting and completing the data, there still remains the issue that the energy bins in the respective instrument channels are different (even for the SEM instruments there are non-neglible differences), so the data cannot easily be combined. This requires additional processing of the data: Only at this stage can the data be used for statistical analyses. The various processing steps are described in detail in the following sections. All processing (except the dataset merging, which was done manually) was performed by means of the tools available on the SEPEM server, and can be verified or repeated by any registered user.

The SEPEM reference proton dataset is intended to be used for analysing proton induced radation effects. The energy range covered by the dataset is 5–200 MeV, in ten logarithmically spaced channels, which can be interpreted as a virtual instrument dataset (the energy channels are listed below).

Data selection

The long duration datasets used to construct the SEPEM reference proton dataset have been collected by instruments on IMP8 and on the GOES spacecraft series.

For IMP8, data from the GME and CPME instruments have been analysed for suitability in the reference dataset construction. It turned out that the CPME instrument suffered from severe saturation (and possibly other) effects, and that in addition there were numerous gaps in the dataset. Therefore, it was decided not to use these data. The GME data also show saturation effects and data gaps, but to a less severe extent than the CPME data. Nevertheless, especially during the second half of the mission, a number of very large SEP events are insufficiently covered by the GME data. A detailed analysis of the data caveats was compiled for the CPME and GME datasets.

The GOES/SEM proton data do not suffer from saturation, and only have relatively small data gaps, which can be filled with interpolation or usage of the secondary GOES spacecraft data. A detailed description of the GOES/SEM proton datasets and caveats was also compiled.

In order to make maximum use of the strong points of the respective datasets, the following procedure was used to combine the data:

Data selection for the reference proton dataset
Original dataset Original time span Time span of the selected data Comments
IMP8/GME 01-11-1973–26-10-2001 01-11-1973–26-10-2001 Primary dataset prior to 01-01-1984. Data were also used to cross-calibrate the GOES/SEM data. Channel 92.5–107.0 MeV was not used.
GOES05/SEM 01-01-1984–24-03-1987 01-01-1984–05-03-1987  
GOES07/SEM 06-03-1987–12-08-1996 06-03-1987–28-02-1995  
GOES08/SEM-2 01-03-1995–31-05-2003 01-03-1995–31-05-2003
GOES11/SEM-2 01-07-2000–28-02-2011 21-06-2003–28-02-2011 Due to large data gaps, data prior to 21-06-2003 are only used for cross-calibration.
GOES12/SEM-2 01-01-2003–28-02-2010 01-06-2003–20-06-2003 Used to bridge the three week gap between GOES08 and GOES11. Channels P6 and P7 are missing, data values for this period set to 0.0026 and 0.0012, respectively (surrounding background level, verified with GOES10 that these channels remain at background level during this period).
GOES13/EPS 01-05-2010–30-06-2013 01-03-2011–31-03-2013 The cross-calibration factors for GOES11 were applied as there is insufficient overlap.

As the energy range of the SEPEM proton reference dataset is limited to 5–200 MeV, not all energy channels of the GME, SEM and EPS instruments are required. The tables below list the channels which were retained for the construction of the reference dataset.

Energy channels used for the proton reference dataset. Energies are given in MeV.
Reference dataset
Channel nameEnergy range
F15.00–7.23
F27.23–10.46
F310.46–15.12
F415.12–21.87
F521.87–31.62
F631.62–45.73
F745.73–66.13
F866.13–95.64
F995.64–138.3
F10138.3–200.0
GOES05-07/SEM
Channel nameEnergy range
P24.2–8.7
P38.7–14.5
P415.0–44.0
P539.0–82.0
P684.0–200.0
P7110.0–500.0
GOES08-12/SEM-2
Channel nameEnergy range
P24.0–9.0
P39.0–15.0
P415.0–40.0
P540.0–80.0
P680.0–165.0
P7165.0–500.0
GOES13/EPS
Channel nameEnergy range
P24.2–8.7
P38.7–14.5
P415.0–40.0
P538.0–82.0
P684.0–200.0
P7110.0–900.0
IMP8/GME
Channel nameEnergy range
DIntn_84.94–5.96
DIntn_95.96–7.25
DIntn_107.25–8.65
DIntn_118.65–11.10
DIntn_1211.10–13.60
DIntn_1313.60–16.10
DIntn_1416.10–18.70
DIntn_1518.70–22.50
DIntn_1619.80–24.20
DIntn_1724.20–28.70
DIntn_1828.70–35.20
DIntn_1935.20–42.90
DIntn_2042.90–51.00
DIntn_2151.00–63.20
DIntn_2263.20–81.00
DIntn_2387.00–92.50
DIntn_25107.0–121.0
DIntn_26121.0–154.0
DIntn_27154.0–178.0
DIntn_28178.0–230.0

Data cleaning and gap filling

The datasets used for constructing the proton reference datasets contain numerous spikes and other corrupted data records. In addition, the GME data during the largest SEP events show saturation effects. The corrupted data records have to removed or corrected as they will contaminate any statistical analysis.

As an illustration, the figure below shows the GOES07/SEM proton data during the Oct 89 event.


GOES07/SEM data for the Oct 89 event

The next figure shows the same event as seen in two channels of the GME data. The following defects are immediately obvious:


Two channels of the raw IMP8/GME data for the Oct 89 event

Upon closer examination, and comparison with the GOES/SEM fluxes, it turns out that the complete rise phase and the flux peak suffer from contamination. After removing the affected records, what remains is shown in the figure below. It is clear that most of the event is missed using the GME data. The same applies to several more large events, and in general, even during smaller events, the peak phase is often missed if saturated points are removed.


Two channels of the raw IMP8/GME data for the Oct 89 event, after removing the saturated and spurious fluxes

The situation for the SEM data is better, in the sense that saturation does not occur. However, the SEM data suffer from the appearance of data spikes, as illustrated below for a month of data of GOES05/SEM.


GOES05/SEM data sample illustrating data spikes

It is obvious that these spikes need to be removed before further data processing can take place. During the SEPEM project, a number of algorithms were tested to try to automatically recognize and remove data spikes. One automated method is implemented on the SEPEM server: median filtering. Although this method is able to recognize many of the spikes and to remove them, it has the tendency of lowering the event peak flux (by the nature of the method), and it does not recognize all data spikes.

Other methods were tried, but in the end it was decided to remove the spikes by hand. To this effect, an application was developed for the SEPEM server where suspicious data points can be marked and removed. This application was used to manually remove all data spikes in the GOES and IMP data that were used to construct the proton reference dataset. The figure below shows the same GOES05/SEM data sample as above after removing the spikes.


GOES05/SEM data sample of the previous figure after removing the data spikes

After removing the data spikes, the resulting gaps, and any other gaps in the original data, need to be filled in order to arrive at a continuous dataset. The figure below shows a two day sample of GOES07/SEM data where data gaps are clearly visible.


GOES07/SEM data sample showing data gaps

Using the data cleaning tool, these gaps were filled using a linear interpolation. The result is shown in the figure below.


GOES07/SEM data sample after filling the data gaps

All data gaps in the GOES/SEM data were filled in the same way, resulting in a new set of GOES/SEM data where all spikes have been removed, and all gaps filled. The data gaps in the IMP8/GME were not filled, as they are too large. The cleaned datasets are available on the system as tables standard_0001 to standard_0007.

In the SEPEM database, a separate table was created to store all data removal and gap filling actions, i.e. every data point, for the individual channels, that was removed or replaced during gap filling, has been logged.

Energy re-binning

Now that the respective datasets have been cleaned for data spikes and gaps have been filled (where possible), re-binning of the data into the 10 energy channels of the proton reference dataset can be performed. During the SEPEM project, a number of re-binning schemes were tried and compared.

Firstly, analytical fits to the energy spectra for each data record were computed. Using the analytical fits, new datasets were produced for the energy channels of the proton reference dataset. Three fit functions were tested: power law in energy, exponential in energy, and exponential in rigidity. After comparing the fitted data to the original data, it was concluded that using analytical fits over the total energy range of the proton reference dataset did not produce acceptable results, for the following reasons:


Fluence spectra for the 13–23 Jul 2000 event using cleaned GOES11/SEM data: black squares represent the integrated data, coloured squares represent analytical fits for the energy channels of the proton reference dataset.

For reference, the same plot is shown using the proton reference dataset (which was obtained by applying cross-calibration to the IMP8/GME data, as described in the next section).


Fluence spectra for the 13–23 Jul 2000 event using the proton reference dataset: black squares represent the integrated data, coloured squares represent analytical fits for the energy channels of the proton reference dataset.

Using the cross-calibrated data, the background signal in the high energy channel has been substantially reduced, resulting in much better spectrum fits. Similar behaviour is shown over the entire dataset; the plots shown here were generated using the event spectra tool on the SEPEM server.

As applying analytical fit functions over the whole spectrum energy range results in unreliable spectra, it was decided to apply power law fits over each separate energy channel: for each energy in the proton reference dataset, the flux values at the boundaries of the original data channel enclosing the reference energy were used to interpolate the original flux to the reference energy. This procedure was repeated for all reference energy channels, for all data records in the GME and SEM datasets used to construct the proton reference dataset. This procedure was performed using the energy re-binning tool on the SEPEM server, for each of the five datasets used. The fitted data were stored in separate tables, for cross-calibration as described below.

Cross-calibration

At this stage, five new datasets, re-binned into the proton reference energy channels, are available: the GME dataset, and four GOES/SEM datasets.

Before merging these datasets into a single contiguous set, one more step needs to be taken. The SEM(-2) instruments on the various GOES spacecraft are monitor instruments and are not rigorously calibrated, and exhibit significant differences in response, making a simple concatenation of the re-binned datasets impossible. The procedure adopted to use a common baseline for the SEM(-2) instruments on the four GOES spacecraft used for the proton reference dataset, consist of using the GME data as a reference. The GME instrument is a science quality instrument, which has been properly calibrated.

For each of the four GOES datasets, and for each of the ten reference energy channels, linear regression fits were calculated to scatter plots of the GOES and GME data (after re-binning in energy). The figures below show the scatter plot for the 21.87–31.62 MeV channel, using GOES08 data, on a linear and logarithmic scale, respectively. The regression fit was calculated using the original values, not the logarithms.


Scatter plot of the overlapping IMP8/GME and GOES08/SEM-2 data for proton reference channel 21.87–31.62 MeV. The solid green line represents equality, the read line is the linear regression fit (the regression relation is shown at the top of the plot).


Scatter plot of the overlapping IMP8/GME and GOES08/SEM-2 data shown in the previous figure, now on a log-log scale. The regression line is the same as before (i.e. it was not re-calculated using a logarithmic scale).

For each data channel, the inverse regression fit was then applied to the GOES data, to align them with the GME data. The figures below show the data of the two above figures after applying the inverse fit. After applying the cross-calibration, the data are now scattered around the line of equality.


Scatter plot of the overlapping IMP8/GME and GOES08/SEM-2 data for proton reference channel 21.87–31.62 MeV, after applying the reverse regression fit. The solid green line represents equality, the read line is the original regression fit. The data are plotted on the same scale as the original plots.


Scatter plot of the overlapping IMP8/GME and GOES08/SEM-2 data shown in the previous figure, now on a log-log scale. The regression line is the same as before (i.e. it was not re-calculated using a logarithmic scale).

As a further test, the regression fits were applied to the cross-calibrated data, shown in the figures below. The regression line thus obtained is identical to the line of equality.


Scatter plot of the overlapping IMP8/GME and cross-calibrated GOES08/SEM-2 data for proton reference channel 21.87–31.62 MeV.


Scatter plot of the overlapping IMP8/GME and cross-calibrated GOES08/SEM-2 data shown in the previous figure, now on a log-log scale.

Similar plots were produced for all proton reference energy channels, for the four GOES datasets used for the proton reference dataset. The plots are available as a zip archive. All regression fits were performed with the Cross-calibration tool on the SEPEM server.

Merging the datasets

The final step in the production of the proton reference dataset consists of merging the re-binned and cross-calibrated datasets, as specified in the data selection table at the top of the page. The full time range of the respective datasets was used for the cross-calibrations in order to ensure maximum overlap with the GME data. The cross-calibrated datasets were then trimmed to the time ranges specified in the third column of the table. The successive data tables were then merged into a new table, standard_0008, which contains the final proton reference dataset.




Last modified on: 12 July 2013.