Data treatment is the process of altering indicators to improve their statistical properties, mainly for the purposes of aggregation. Data treatment is a delicate subject, because it essentially involves changing the values of certain observations, or transforming an entire distribution. Like any other step or assumption though, any data treatment should be carefully recorded and its implications understood. Of course, data treatment does not have to be applied, it is simply another tool in your toolbox.
Treat()
functionThe COINr function for treating data is called Treat()
.
This is a generic function with methods for coins, purses, data frames
and numeric vectors. It is very flexible but this can add a layer of
complexity. If you want to run mostly at default options, see the
qTreat()
function mentioned below in Simplified function.
The Treat()
function operates a two-stage data treatment
process, based on two data treatment functions (f1
and
f2
), and a pass/fail function f_pass
which
detects outliers. The arrangement of this function is inspired by a
fairly standard data treatment process applied to indicators, which
consists of checking skew and kurtosis, then if the criteria are not
met, applying Winsorisation up to a specified limit. Then if
Winsorisation still does not bring skew and kurtosis within limits,
applying a nonlinear transformation such as log or Box-Cox.
This function generalises this process by using the following general steps:
f_pass
f_pass
returns FALSE
, apply
f1
, else return x
unmodifiedf_pass
f_pass
still returns FALSE
, apply
f2
x
as well as other
information.For the “typical” case described above f1
is a
Winsorisation function, f2
is a nonlinear transformation
and f_pass
is a skew and kurtosis check. However, any
functions can be passed as f1
, f2
and
f_pass
, which makes it a flexible tool that is also
compatible with other packages.
Further details on how this works are given in the following sections.
The clearest way to demonstrate the Treat()
function is
on a numeric vector. Let’s make a vector with a couple of outliers:
We can check the skew and kurtosis of this vector:
The skew and kurtosis are both high. If we follow the default limits
in COINr (absolute skew capped at 2, and kurtosis capped at 3.5), this
would be classed as a vector with outliers. Indeed we can confirm this
using the check_SkewKurt()
function, which is the default
pass/fail function used in Treat()
. This also anyway
outputs the skew and kurtosis:
Now we know that x
has outliers, we can treat it (if we
want). We use the Treat()
function to specify that our
function for checking for outliers
f_pass = "check_SkewKurt"
, and our first function for
treating outliers is f1 = "winsorise"
. We also pass an
additional parameter to winsorise()
, which is
winmax = 2
. You can check the winsorise()
function documentation to better understand how it works.
l_treat <- Treat(x, f1 = "winsorise", f1_para = list(winmax = 2),
f_pass = "check_SkewKurt")
plot(x, l_treat$x)
The result of this data treatment is shown in the scatter plot: one
point from x
has been Winsorised (reassigned the next
highest value). We can check the skew and kurtosis of the treated
vector:
check_SkewKurt(l_treat$x)
#> $Pass
#> [1] TRUE
#>
#> $Skew
#> [1] 1.712038
#>
#> $Kurt
#> [1] 1.815781
Clearly, Winsorising one point was enough in this case to bring the skew and kurtosis within the specified thresholds.
Treatment of a data frame with Treat()
is effectively
the same as treating a numeric vector, because the data frame method
passes each column of the data frame to the numeric method. Here, we use
some data from the COINr package to demonstrate.
# select three indicators
df1 <- ASEM_iData[c("Flights", "Goods", "Services")]
# treat the data frame using defaults
l_treat <- Treat(df1)
str(l_treat, max.level = 1)
#> List of 3
#> $ x_treat :'data.frame': 51 obs. of 3 variables:
#> $ Dets_Table :'data.frame': 3 obs. of 8 variables:
#> $ Treated_Points:'data.frame': 51 obs. of 3 variables:
We can see the output is a list with x_treat
, the
treated data frame; Dets_Table
, a table describing what
happened to each indicator; and Treated_Points
, which marks
which individual points were adjusted. This is effectively the same
output as for treating a numeric vector.
l_treat$Dets_Table
#> iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt
#> 1 Flights FALSE 2.103287 4.508879
#> 2 Goods FALSE 2.649973 8.266610
#> 3 Services TRUE 1.701085 2.375656
#> winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt
#> 1 1 TRUE 1.900658 3.3360647
#> 2 2 TRUE 1.140608 0.1572047
#> 3 NA NA NA NA
We also check the individual points:
l_treat$Treated_Points
#> Flights Goods Services
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10
#> 11 winhi
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17
#> 18
#> 19
#> 20
#> 21
#> 22
#> 23
#> 24
#> 25
#> 26
#> 27
#> 28
#> 29
#> 30 winhi
#> 31
#> 32
#> 33
#> 34
#> 35 winhi
#> 36
#> 37
#> 38
#> 39
#> 40
#> 41
#> 42
#> 43
#> 44
#> 45
#> 46
#> 47
#> 48
#> 49
#> 50
#> 51
Treating coins is a simple extension of treating a data frame. The coin method simply extracts the relevant data set as a data frame, and passes it to the data frame method. So more or less, the same arguments are present.
We begin by building the example coin, which will be used for the examples here.
coin <- build_example_coin(up_to = "new_coin")
#> iData checked and OK.
#> iMeta checked and OK.
#> Written data set to .$Data$Raw
The Treat()
function can be applied directly to a coin
with completely default options:
For each indicator, the Treat()
function:
check_SkewKurt()
functionFALSE
),
applies WinsorisationIf at any stage the indicator passes the skew and kurtosis test, it is returned without further treatment.
When we run Treat()
on a coin, it also stores
information returned from f1
, f2
and
f_pass
in the coin:
# summary of treatment for each indicator
head(coin$Analysis$Treated$Dets_Table)
#> iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt
#> 1 LPI TRUE -0.3042681 -0.6567514
#> 2 Flights FALSE 2.1032872 4.5088794
#> 3 Ship TRUE -0.5756680 -0.6814795
#> 4 Bord FALSE 2.1482360 5.7914905
#> 5 Elec FALSE 2.2252736 5.7910268
#> 6 Gas FALSE 2.8294486 10.3346494
#> winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt
#> 1 NA NA NA NA
#> 2 1 TRUE 1.900658 3.336065
#> 3 NA NA NA NA
#> 4 1 TRUE 1.899211 4.346298
#> 5 1 TRUE 1.717744 2.586062
#> 6 1 TRUE 1.602518 1.525576
Notice that only one treatment function was used here, since after
Winsorisation (f1
), all indicators passed the skew and
kurtosis test (f_pass
).
In general, Treat()
tries to collect all information
returned from the functions that it calls. Details of the treatment of
individual points are also stored in
.$Analysis$Treated$Treated_Points
.
The Treat()
function gives you a high degree of control
over which functions are used to treat and test indicators, and it is
also possible to specify different functions for different indicators.
Let’s begin though by seeing how we can change the specifications for
all indicators, before proceeding to individual treatment.
Unless indiv_specs
is specified (see later), the same
procedure is applied to all indicators. This process is specified by the
global_specs
argument. To see how to use this, it is
easiest to show the default of this argument which is built into the
treat()
function:
# default treatment for all cols
specs_def <- list(f1 = "winsorise",
f1_para = list(na.rm = TRUE,
winmax = 5,
skew_thresh = 2,
kurt_thresh = 3.5,
force_win = FALSE),
f2 = "log_CT",
f2_para = list(na.rm = TRUE),
f_pass = "check_SkewKurt",
f_pass_para = list(na.rm = TRUE,
skew_thresh = 2,
kurt_thresh = 3.5))
Notice that there are six entries in the list:
f1
which is a string referring to the first treatment
functionf1_para
which is a list of any other named arguments to
f1
, excluding x
(the data to be treated)f2
and f2_para
which are analogous to
f1
and f1_para
but for the second treatment
functionf_pass
is a string referring to the function to check
for outliersf_pass_para
a list of any other named arguments to
f_pass
, other than x
(the data to be
checked)To understand what the individual parameters do, for example in
f1_para
, we need to look at the function called by
f1
, which is the winsorise()
function:
x
A numeric vector.na.rm
Set TRUE
to remove NA
values, otherwise returns NA
.winmax
Maximum number of points to Winsorise. Default
5. Set NULL
to have no limit.skew_thresh
A threshold for absolute skewness
(positive). Default 2.25.kurt_thresh
A threshold for kurtosis. Default 3.5.force_win
Logical: if TRUE
, forces
winsorisation up to winmax (regardless of skew/kurt).Here we see the same parameters as named in the list
f1_para
, and we can change the maximum number of points to
be Winsorised, the skew and kurtosis thresholds, and other things.
To make adjustments, unless we want to redefine everything, we don’t
need to specify the entire list. So for example, if we want to change
the maximum Winsorisation limit winmax
, we can just pass
this part of the list (notice we still have to wrap the parameter inside
a list):
# treat with max winsorisation of 3 points
coin <- Treat(coin, dset = "Raw", global_specs = list(f1_para = list(winmax = 1)))
#> Written data set to .$Data$Treated
#> (overwritten existing data set)
# see what happened
coin$Analysis$Treated$Dets_Table |>
head(10)
#> iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt
#> 1 LPI TRUE -0.3042681 -0.6567514
#> 2 Flights FALSE 2.1032872 4.5088794
#> 3 Ship TRUE -0.5756680 -0.6814795
#> 4 Bord FALSE 2.1482360 5.7914905
#> 5 Elec FALSE 2.2252736 5.7910268
#> 6 Gas FALSE 2.8294486 10.3346494
#> 7 ConSpeed TRUE 0.4622037 0.1873214
#> 8 Cov4G TRUE -1.3725191 0.5419314
#> 9 Goods FALSE 2.6499733 8.2666095
#> 10 Services TRUE 1.7010849 2.3756557
#> winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew
#> 1 NA NA NA
#> 2 1 TRUE 1.900658
#> 3 NA NA NA
#> 4 1 TRUE 1.899211
#> 5 1 TRUE 1.717744
#> 6 1 TRUE 1.602518
#> 7 NA NA NA
#> 8 NA NA NA
#> 9 1 FALSE 2.469910
#> 10 NA NA NA
#> check_SkewKurt1.Kurt check_SkewKurt2.Pass check_SkewKurt2.Skew
#> 1 NA NA NA
#> 2 3.336065 NA NA
#> 3 NA NA NA
#> 4 4.346298 NA NA
#> 5 2.586062 NA NA
#> 6 1.525576 NA NA
#> 7 NA NA NA
#> 8 NA NA NA
#> 9 7.087309 TRUE 0.03104001
#> 10 NA NA NA
#> check_SkewKurt2.Kurt
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
#> 7 NA
#> 8 NA
#> 9 -0.8888965
#> 10 NA
Having imposed a much stricter Winsorisation limit (only one point),
we can see that now one indicator has been passed to the second
treatment function f2
, which has performed a log
transformation. After doing this, the indicator passes the skew and
kurtosis test.
By default, if an indicator does not satisfy f_pass
after applying f1
, it is passed to f2
in
its original form - in other words it is not the output of
f1
that is passed to f2
, and f2
is applied instead of f1
, rather than in addition
to it. If you want to apply f2
on top of f1
set combine_treat = TRUE
. In this case, if
f_pass
is not satisfied after f1
then the
output of f1
is used as the input of f2
. For
the defaults of f1
and f2
this approach is
probably not advisable because Winsorisation and the log transform are
quite different approaches. However depending on what you want to do, it
might be useful.
The global_specs
specifies the treatment methodology to
apply to all indicators. However, the indiv_specs
argument
(if specified), can be used to override the treatment specified in
global_specs
for specific indicators. It is specified in
exactly the same way as global_specs
but requires a
parameter list for each indicator that is to have individual
specifications applied, wrapped inside one list.
This is probably clearer using an example. To begin with something simple, let’s say that we keep the defaults for all indicators except one, where we change the Winsorisation limit. We will set the Winsorisation limit of the indicator “Flights” to zero, to force it to be log-transformed.
# change individual specs for Flights
indiv_specs <- list(
Flights = list(
f1_para = list(winmax = 0)
)
)
# re-run data treatment
coin <- Treat(coin, dset = "Raw", indiv_specs = indiv_specs)
#> Written data set to .$Data$Treated
#> (overwritten existing data set)
The only thing to remember here is to make sure the list is created
correctly. Each indicator to assign individual treatment must have its
own list - here containing f1_para
. Then
f1_para
itself is a list of named parameter values for
f1
. Finally, all lists for each indicator have to be
wrapped into a single list to pass to indiv_specs
. This
looks a bit convoluted for changing a single parameter, but gives a high
degree of control over how data treatment is performed.
We can now see what happened to “Flights”:
coin$Analysis$Treated$Dets_Table[
coin$Analysis$Treated$Dets_Table$iCode == "Flights",
]
#> iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt
#> 2 Flights FALSE 2.103287 4.508879
#> winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt
#> 2 0 FALSE 2.103287 4.508879
#> check_SkewKurt2.Pass check_SkewKurt2.Skew check_SkewKurt2.Kurt
#> 2 TRUE -0.09502644 -0.8305217
Now we see that “Flights” didn’t pass the first Winsorisation step (because nothing happened to it), and was passed to the log transform. After that, the indicator passed the skew and kurtosis check.
As another example, we may wish to exclude some indicators from data
treatment completely. To do this, we can set the corresponding entries
in indiv_specs
to "none"
. This is the only
case where we don’t have to pass a list for each indicator.
# change individual specs for two indicators
indiv_specs <- list(
Flights = "none",
LPI = "none"
)
# re-run data treatment
coin <- Treat(coin, dset = "Raw", indiv_specs = indiv_specs)
#> Written data set to .$Data$Treated
#> (overwritten existing data set)
Now if we examine the treatment table, we will find that these indicators have been excluded from the table, as they were not subjected to treatment.
Any functions can be passed to Treat()
, for both
treating and checking for outliers. As an example, we can pass an
outlier detection function ` from the performance
package
The following code chunk will only run if you have the ‘performance’ package installed.
library(performance)
# the check_outliers function outputs a logical vector which flags specific points as outliers.
# We need to wrap this to give a single TRUE/FALSE output, where FALSE means it doesn't pass,
# i.e. there are outliers
outlier_pass <- function(x){
# return FALSE if any outliers
!any(check_outliers(x))
}
# now call treat(), passing this function
# we set f_pass_para to NULL to avoid passing default parameters to the new function
coin <- Treat(coin, dset = "Raw",
global_specs = list(f_pass = "outlier_pass",
f_pass_para = NULL)
)
#> Warning in proc_passing(passing, f_pass, 0): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 2): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 0): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 2): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 2): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 2): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 0): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 2): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 0): f_pass has returned NA. Returning
#> untreated vector.
#> Warning in proc_passing(passing, f_pass, 0): f_pass has returned NA. Returning
#> untreated vector.
#> Written data set to .$Data$Treated
#> (overwritten existing data set)
# see what happened
coin$Analysis$Treated$Dets_Table |>
head(10)
#> iCode outlier_pass0 winsorise.nwin outlier_pass1 outlier_pass2
#> 1 LPI TRUE NA NA NA
#> 2 Flights FALSE 1 FALSE TRUE
#> 3 Ship TRUE NA NA NA
#> 4 Bord FALSE 1 FALSE TRUE
#> 5 Elec FALSE 1 FALSE TRUE
#> 6 Gas FALSE 1 FALSE FALSE
#> 7 Cov4G FALSE 0 FALSE FALSE
#> 8 Goods FALSE 2 FALSE TRUE
#> 9 Services FALSE 0 FALSE TRUE
#> 10 FDI FALSE 1 FALSE TRUE
Here we see that the test for outliers is much stricter and very few of the indicators pass the test, even after applying a log transformation. Clearly, how an outlier is defined can vary and depend on your application.
The purse method for treat()
is fairly straightforward.
It takes almost the same arguments as the coin method, and applies the
same specifications to each coin. Here we simply demonstrate it on the
example purse.
# build example purse
purse <- build_example_purse(up_to = "new_coin", quietly = TRUE)
# apply treatment to all coins in purse (default specs)
purse <- Treat(purse, dset = "Raw")
#> Written data set to .$Data$Treated
#> Written data set to .$Data$Treated
#> Written data set to .$Data$Treated
#> Written data set to .$Data$Treated
#> Written data set to .$Data$Treated
The Treat()
function is very flexible but comes at the
expense of a possibly fiddly syntax. If you don’t need that level of
flexibility, consider using qTreat()
, which is a simplified
wrapper for Treat()
.
The main features of qTreat()
are that:
f1
cannot be changed and
is set to winsorise()
.winmax
parameter, as well as the skew and kurtosis
limits, are available directly as function arguments to
qTreat()
.f_pass
function cannot be changed and is always set
to check_SkewKurt()
.f2
The qTreat()
function is a generic with methods for data
frames, coins and purses. Here, we’ll just demonstrate it on a data
frame.
# select three indicators
df1 <- ASEM_iData[c("Flights", "Goods", "Services")]
# treat data frame, changing winmax and skew/kurtosis limits
l_treat <- qTreat(df1, winmax = 1, skew_thresh = 1.5, kurt_thresh = 3)
Now we check what the results are:
l_treat$Dets_Table
#> iCode check_SkewKurt0.Pass check_SkewKurt0.Skew check_SkewKurt0.Kurt
#> 1 Flights FALSE 2.103287 4.508879
#> 2 Goods FALSE 2.649973 8.266610
#> 3 Services TRUE 1.701085 2.375656
#> winsorise.nwin check_SkewKurt1.Pass check_SkewKurt1.Skew check_SkewKurt1.Kurt
#> 1 1 FALSE 1.900658 3.336065
#> 2 1 FALSE 2.469910 7.087309
#> 3 NA NA NA NA
#> check_SkewKurt2.Pass check_SkewKurt2.Skew check_SkewKurt2.Kurt
#> 1 TRUE -0.09502644 -0.8305217
#> 2 TRUE 0.03104001 -0.8888965
#> 3 NA NA NA
We can see that in this case, Winsorsing by one point was not enough
to bring “Flights” and “Goods” within the specified skew/kurtosis
limits. Consequently, f2
was invoked, which uses a log
transform and brought both indicators within the specified limits.