How to prepare input data for santaR

The santaR package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. It is robust to missing values and noisy measurements without requiring synchronisation in time.

This vignette will:

  • Detail the input format expected by the package
  • Present the provided example dataset ‘acuteInflammation’
  • Save ‘acuteInflammation’ in a .csv and .RData files to be used as input for the graphical interface tutorial.

Data format

In short, for a given variable, each measurement (observation) is a row in a vector.

If more than one variable has been measured at a given time, multiple measurement columns can be provided in a Data.Frame (data) with observations as rows and variables as columns.

For each data point (row), the following metadata vectors are required (or can be stored in a Data.Frame metadata):

  • time, the time at which the observation has been taken.
  • ind identifying which subject (individual) is associated with the observation.

Optionally:

  • group an identifier indicating to which study group the observation belongs.

All observations of a given individual need to be affected to the same group. If 2 groups exist, significantly altered time trajectories can be identified. If no group or more than 2 groups are provided, the trajectories can be plotted but significance cannot be calculated.

data and metadata information can be stored as vectors, in one or in two separate Data.Frame. If a data-point is not available (no data value for any variables) the row should be discarded. If some of the variable measurements are missing for a given time-point, the value can be replace by NaN. Do not inpute data as the package is explicitely designed to be robust to missing values.

Here is an example of 5 observations of 2 variables. Taken on 3 individual separated in 2 goups, covering 3 time-points:

# Metadata
ind time group
ind_1 0 group_A
ind_1 5 group_A
ind_2 0 group_B
ind_2 10 group_B
ind_3 5 group_A
# Data
variable1 variable2
1 110.2
3.5 NA
4 79.1
9.5 132
5 528.3

Introducing the dataset ‘acuteInflammation’

The santaR package is designed for the analysis of short noisy time-series as produced in most ‘-omics’ platforms, an example of which is provided. This dataset referred to as acuteInflammation contains the concentrations of 22 mediators of inflammation over an episode of acute inflammation. The mediators have been measured at 7 time-points on 8 subjects, concentration values have been unit-variance scaled for each variable.

acuteInflammation is stored as two Data.Frame; meta for the 56 observations metadata, and data for the 22 variables measurements:

library(santaR)

## Metadata
# number of rows
nrow(acuteInflammation$meta)
# number of columns
ncol(acuteInflammation$meta)
# a subset
acuteInflammation$meta[12:20,]

[1] 56

[1] 3

  time ind group
12 4 ind_4 Group2
13 4 ind_5 Group1
14 4 ind_6 Group2
15 4 ind_7 Group1
16 4 ind_8 Group2
17 8 ind_1 Group1
18 8 ind_2 Group2
19 8 ind_3 Group1
20 8 ind_4 Group2
## Data
# number of rows
nrow(acuteInflammation$data)
# number of columns
ncol(acuteInflammation$data)
# a subset
acuteInflammation$data[12:20,1:4]

[1] 56

[1] 22

  var_1 var_2 var_3 var_4
12 2.498 1.307 0.08296 1.183
13 -0.3399 -0.6434 0.03206 -0.8927
14 2.668 2.464 1.365 1.743
15 -0.3002 0.05366 0.4509 0.01572
16 3.777 2.543 1.858 2.213
17 -0.3275 0.1564 0.585 0.03299
18 0.708 0.4893 -0.08219 0.9345
19 -0.4101 -0.03727 -0.2914 -0.7239
20 -0.1577 -0.6434 -0.7398 -0.2126

Preparing the csv input for the graphical user interface

While the command line functions accept Data.Frame and vectors as input, the graphical user interface will read a .csv file.

By concatenating acuteInflammation’s data and metadata tables and saving them in a .csv file, we can prepare the input dataset for the graphical user interface tutorial:

library(santaR)

# Concatenate
outputTable <- cbind(acuteInflammation$meta, acuteInflammation$data)

# Save to disk
outputPath = file.path('path_to_my_output_folder', 'acuteInflammation_GUI_demo.csv') 
write.csv(outputTable, file=outputPath, row.names=FALSE)

It is also possible to provide the data directly as 2 Data.Frames stored in a .RData file; containing the data in a DataFrame named inData and metadata in a DataFrame named inMeta:

library(santaR)

# Rename datasets
inMeta <- acuteInflammation$meta
inData <- acuteInflammation$data
            
# Save to disk
outputPath = file.path('path_to_my_output_folder', 'acuteInflammation_GUI_demo.rdata') 
save(inMeta, inData, file=outputPath, compress=TRUE)