---
title: "Automated command line analysis"
author: "Arnaud Wolfer"
date: "2019-10-03"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Automated command line analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The `santaR` package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. Command line parallelisation and reporting functions allow the automated analysis of multiple variables.

The automated command line functions are to be prefered to the GUI for the processing of very high number of variables, as they are more efficient and can be integrated in scripts.

Using an example dataset, this vignette will:

* Detail the parallel processing function
* Detail the automated reporting function
* Save the processing results in a `.RData` file to be opened with the graphical interface for further analysis


## Parallel processing

In a same experiment, multiple variables can be measured and explored dynamically (_e.g. NMR or MS features, genes_). As `santaR`'s analysis is an univariate approach, each variable can be fitted independently. This lack of dependency renders `santaR`'s analysis an embarrassingly parallel workload.

The `santaR_auto_fit()` function is a wrapper for each of the analytical functions (i.e. `get_ind_time_matrix()`, `santaR_fit()`, `santaR_CBand()`, `santaR_pvalue_dist()` and `santaR_pvalue_fit()`), executing them in a parallel fashion (_for each individual function see the help and [advanced command line options vignette](advanced-command-line-functions.html)_). 
The parallelisation relies on the `doParallel` package for the instantiation of worker nodes and `foreach` for the distribution of tasks. This set of packages enable the parallelisation on all operating systems (_Windows, Mac OS and most Linux distributions_).

Observation values are expected as a data-frame of samples as _rows_ and variables as _columns_, the parallelisation taking place over the _columns_. For a selected number of CPU cores (`ncores` parameter), `santaR_auto_fit()` first instantiate worker nodes (_if `ncores=0`, the procedure is applied sequentially (no parallelisation)_). The conversion of inputs by `get_ind_time_matrix()` is however not parallelised by default as the parallelisation overhead time cost is superior to the time gain for all but the most complex datasets. When the number of individuals, unique time points, or variables is elevated, the `forceParIndTimeMat` parameter enables the parallelisation of this step. All subsequent analytical steps are automatically parallelised, with the calculation of confidence bands on the group mean curves and the identification of altered trajectory activated by default.

`santaR_auto_fit()` returns a list of _SANTAObj_ containing each variable's analysis results. In practice, `santaR_auto_fit()` is the function employed for command line
analysis as it caters for all possible use cases.

```{r, eval = FALSE}
library(santaR)

# Load example data
tmp_data  <- acuteInflammation$data
tmp_meta  <- acuteInflammation$meta

# Analyse data, with confidence bands and p-value
res_acuteInf_df5 <- santaR_auto_fit(inputData=tmp_data, ind=tmp_meta$ind, time=tmp_meta$time, group=tmp_meta$group, df=5, ncores=4, CBand=TRUE, pval.dist=TRUE)
# Input data generated: 0.13 secs
# Spline fitted:        1.05 secs
# ConfBands done:      18.98 secs
# p-val dist done:     35.43 secs
# total time:          55.59 secs

length(res_acuteInf_df5)
# [1] 22
names(res_acuteInf_df5)
#  [1] "var_1"  "var_2"  "var_3"  "var_4"  "var_5"  "var_6"  "var_7"  "var_8"  "var_9"  "var_10" "var_11" "var_12" "var_13" "var_14" "var_15" "var_16" "var_17" "var_18"
# [19] "var_19" "var_20" "var_21" "var_22"
```


## Automated Reporting

After multiple variables have been analysed using `santaR_auto_fit()`, a reporting function helps assess significant results and summarise them in an easily interpretable fashion. `santaR_auto_summary()` takes a list of _SANTAObj_ as generated by `santaR_auto_fit()` as input.

First, correction for multiple testing can be applied to generate Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli corrected _p_-values. _P_-values can
be returned by the function, but also automatically saved to disk as `.csv`.
For a given significance cut-off (`plotCutOff` parameter), the number of variables significantly altered is reported and plots are automatically saved to disk by increasing _p_-value. The aspect of the plots can be altered using multiple options such as the representation of confidence bands (`showConfBand` parameter) or the generation of a mean curve across all samples (`showTotalMeanCurve` parameter) which can help assess difference between groups when group sizes are unbalanced.

```{r, eval = FALSE}
# Generate a summary
#   without a defined 'targetFolder', no csv or plots can be saved
pval_acuteInf_df5 <- santaR_auto_summary(SANTAObjList=res_acuteInf_df5, targetFolder=NA)
# p-value dist found
# Benjamini-Hochberg corrected p-value

names(pval_acuteInf_df5)
# [1] "pval.all"     "pval.summary"

pval_acuteInf_df5$pval.summary
```
```{r, results = "asis", echo = FALSE}
pval.summary            <- data.frame(matrix(c('dist', 'dist_BH', 17, 16, 8, 0, 0, 0), ncol=4))
colnames(pval.summary)  <- c('Test', 'Inf 0.05', 'Inf 0.01', 'Inf 0.001')
pander::pandoc.table(pval.summary)
```
```{r, eval = FALSE}
pval_acuteInf_df5$pval.all
```
```{r, results = "asis", echo = FALSE}
pval.all <- data.frame(matrix(c(0.009990010, 0.007992008, 0.006993007, 0.209790210, 0.005994006, 0.008991009, 0.013986014, 0.009990010, 0.038961039, 0.034965035, 0.013986014, 0.214785215, 0.066933067, 0.154845155, 0.008991009, 0.015984016, 0.019980020, 0.029970030, 0.053946054, 0.023976024, 0.022977023, 0.007992008, 0.01829662, 0.01569580, 0.01436896, 0.23611241, 0.01302000, 0.01700412, 0.02334465, 0.01829662, 0.05282484, 0.04824640, 0.02334465, 0.24130467, 0.08413827, 0.17858350, 0.01700412, 0.02581244, 0.03066597, 0.04246854, 0.06973190, 0.03543451, 0.03424914, 0.01569580, 0.005433704, 0.004053809, 0.003390296, 0.185689133, 0.002748896, 0.004735847, 0.008347097, 0.005433704, 0.028625807, 0.025242819, 0.008347097, 0.190448652, 0.053042348, 0.133748457, 0.004735847, 0.009860016, 0.012967910, 0.021068901, 0.041574088, 0.016160798, 0.015355810, 0.004053809, -0.2429725352, 0.0006572238, -0.1309866546, -0.3878298395, -0.5634863016, -0.4766589789, -0.5628753031, -0.4678733066, -0.3890447845, -0.0501685235, 0.0568042664, 0.1530029385, -0.4077714803, -0.0650366487, 0.1268468873,  0.5054671665, 0.2797620452,  0.4027539783, 0.5014823976, 0.3899306066, 0.1458163093, -0.2074773622, 0.02747253, 0.02747253, 0.02747253, 0.21478521, 0.02747253, 0.02747253, 0.03076923, 0.02747253, 0.05042017, 0.04807692, 0.03076923, 0.21478521, 0.07750145, 0.17032967, 0.02747253, 0.03196803, 0.03663004, 0.04395604, 0.06593407, 0.03767661, 0.03767661, 0.02747253), ncol=5))
colnames(pval.all) <- c("dist", "dist_upper", "dist_lower", "curveCorr", "dist_BH")
rownames(pval.all) <- c("var_1", "var_2", "var_3", "var_4", "var_5", "var_6", "var_7", "var_8", "var_9", "var_10", "var_11", "var_12", "var_13", "var_14", "var_15", "var_16", "var_17", "var_18", "var_19", "var_20", "var_21", "var_22")
pander::pandoc.table(pval.all)
```


## Save results for GUI

In practice, time-dependent patterns for a given biological question (_e.g. a grouping of individuals_) are assessed by parallelised fitting and analysis using `santaR_auto_fit()` and reporting using `santaR_auto_summary()`. When results are available, the most significantly altered variables can be identified using the reports and visually inspected for confirmation using the plots already saved to disk.

Additionally analysis results can be loaded into the GUI for interactive visualisation or generation of plots. For that, the list of _SANTAObj_ generated by `santaR_auto_fit()` must be saved under the variable name `inSp` in a `.RData` file:

```{r, eval = FALSE}
# Rename the results
inSp        <- res_acuteInf_df5
# Save to disk
outputPath  <- file.path('path_to_my_output_folder', 'acuteInf_results.rdata') 
save(inSp, file=outputPath, compress=TRUE)
```


## See Also

* [Getting Started with santaR](getting-started.html)
* [How to prepare input data for santaR](prepare-input-data.html)
* [santaR theoretical background](theoretical-background.html)
* [Graphical user interface use](santaR-GUI.pdf)
* [Plotting options](plotting-options.html)
* [Selecting an optimal number of degrees of freedom](selecting-optimal-df.html)
* [Advanced command line options](advanced-command-line-functions.html)