Use of Rmarkdown: taking GFM package as an example

Introduction of Rmarkdown

In this section, we briefly introduce some features of Rmarkdown.The Laguage of Rmarkdown is similar to that of Markdown, which is very easy to learn and write it.

Creating Rmarkdown file in Rstudio

Firstly, we create a new file in Rstudio, then choose the format of this file as ‘Rmd’(extension name). If we uncarefully save it as format ‘md’, then we can not run R code in each small chunk. Second, we need to set the header, including ‘title’, ‘author’, ‘date’, ‘output’, where title is the title of this document, ‘author’ and ‘date’ are the author and created date of this document, ‘output’ specifies the information about the output file. As for output file, we can generate three types of files, including ‘html’, ‘word’ and ‘pdf’, whose setting can be done in the setting whidow of ‘Knit’. The following is an example of header of this file.

title: "Use of Rmarkdown: taking GFM package as an example"
author: "Wei Liu"
date: '2020-11-23'
output:
  pdf_document: 
    highlight: kate
    number_sections: yes
    toc: yes
  word_document:
    toc: yes
  html_document:
    fig_caption: yes
    highlight: pygments
    theme: cerulean
    toc: yes

Aftering finishing the setting, we can arbitrarily write our contents of document. If any problem about the statements, we can turn to Baidu, Biying or Google!

R package GFM

In this section, we provide an inroduction to the GFM package, which is available at . R package GFM implements GFM, the generalized factor models for utra-high-dimensional mixed correlated data. It is more powerful than linear factor analysis, since it can handle mixed data, achieve nonlinear feature extraction and have theoretical guarantee. We can install the package from github by using following codes.

library(devtools)
install_github("feiyoung/GFM")

Load the package using the following command:

library(GFM)

GFM feature extraction using simulated data

In the following, we give some examples with different variable types. ### Homogeneous continuous variables We first generate data with homogeneous normal variables from the following model \[x_{ij}= \mu_j + h_i b_j^T + u_{ij}, \] where \(u_{ij} \sim N(0, \sigma^2)\), which can be generated by function gendata:

n <- 100
p <- 100
q <- 2; rho <- 3
dat <- gendata(q = q, n=n, p=p, rho=rho)
str(dat)

## List of 4
##  $ X  : num [1:100, 1:100] -0.5488 1.4796 0.0512 0.2888 -0.831 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##  $ B0 : num [1:100, 1:2] 0.1991 -0.0584 0.2656 -0.5071 -0.1047 ...
##  $ H0 : num [1:100, 1:2] -0.148658 -0.000162 -0.098823 1.02493 0.628709 ...
##  $ mu0: num [1:100] 0.164 0.676 0.635 -0.132 -0.914 ...

In the above commands, n is the sample size, p is the variable dimension, q is the number of factors, \(\rho\) controls the signal strength. We can refer to the help file using

?gendata

for more details.

Then we fit the GFM model by following commands:

group <- rep(1,ncol(dat$X))
type <- 'gaussian'
# specify q=2
gfm1 <- gfm(dat$X, group, type, q=2, output = F)

## Starting the alternate minimization algorithm...
## Finish the iterative algorithm...

str(gfm1)

## List of 6
##  $ hH     : num [1:100, 1:2] 0.198 -0.324 0.47 0.653 0.317 ...
##  $ hB     : num [1:100, 1:2] 0.1963 -0.169 0.1607 -0.4243 -0.0848 ...
##  $ hmu    : num [1:100] 0.259 0.684 0.638 -0.191 -0.877 ...
##  $ obj    : num 0.966
##  $ q      : num 2
##  $ history:List of 7
##   ..$ dB         : num [1:3] 1 0.01765 0.00541
##   ..$ dH         : num [1:3] 0.0558 0.01407 0.00493
##   ..$ dc         : num [1:3] 1.00 1.13e-04 6.29e-05
##   ..$ c          : num [1:3] 0.966 0.966 0.966
##   ..$ realIter   : num 3
##   ..$ maxIter    : num 50
##   ..$ elapsedTime: 'proc_time' Named num [1:5] 0.09 0.02 1.75 NA NA
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##  - attr(*, "class")= chr "gfm"

# select q automatically
gfm2 <- gfm(dat$X, group, type, q=NULL, q_set = 1:6, output = F)

## The factor number q is estimated as  2 . 
## Starting the alternate minimization algorithm...
## Finish the iterative algorithm...

str(gfm2)

## List of 6
##  $ hH     : num [1:100, 1:2] 0.198 -0.324 0.47 0.653 0.317 ...
##  $ hB     : num [1:100, 1:2] 0.1963 -0.169 0.1607 -0.4243 -0.0848 ...
##  $ hmu    : num [1:100] 0.259 0.684 0.638 -0.191 -0.877 ...
##  $ obj    : num 0.966
##  $ q      : int 2
##  $ history:List of 7
##   ..$ dB         : num [1:3] 1 0.01765 0.00541
##   ..$ dH         : num [1:3] 0.0558 0.01407 0.00493
##   ..$ dc         : num [1:3] 1.00 1.13e-04 6.29e-05
##   ..$ c          : num [1:3] 0.966 0.966 0.966
##   ..$ realIter   : num 3
##   ..$ maxIter    : num 50
##   ..$ elapsedTime: 'proc_time' Named num [1:5] 0.12 0.1 2.14 NA NA
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##  - attr(*, "class")= chr "gfm"

# measure the performance of GFM estimators
measurefun(gfm2$hH, dat$H0, type='ccor')

## [1] 0.8977694

measurefun(gfm2$hB, dat$B0, type='ccor')

## [1] 0.9185544

In the above commands, we require to specify the types of each variables by parameters group and type. At the same time, we can speficy the number of factors to be extracted or let it be automatically selected by PC(IC) criteria. ### Heterogeous continuous variables In this exmaple, we generate data with heterogeous normal variables from the following model \[x_{ij}= \mu_j + h_i b_j^T + u_{ij},\] ,where \(u_{ij} \sim N(0, \sigma_j^2)\), which can be generated by function gendata:

n <- 100
p <- 100
q <- 2; rho <- 4
type <- 'heternorm'
dat <- gendata(seed=1, n=n, p=p, type= type, q=q, rho=rho)
str(dat)

## List of 4
##  $ X  : num [1:100, 1:100] 0.699 -2.001 1.645 0.152 3.115 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##  $ B0 : num [1:100, 1:2] 0.2655 -0.0778 0.3542 -0.6761 -0.1397 ...
##  $ H0 : num [1:100, 1:2] -0.148658 -0.000162 -0.098823 1.02493 0.628709 ...
##  $ mu0: num [1:100] 0.164 0.676 0.635 -0.132 -0.914 ...

group <- rep(1,ncol(dat$X))
type <- 'gaussian'
gfm3 <- gfm(dat$X, group, type, q=NULL, q_set = 1:4, output = F)

## The factor number q is estimated as  2 . 
## Starting the alternate minimization algorithm...
## Finish the iterative algorithm...

plot(gfm3$history$dc, type='o')

We compare the performance with the linear factor model by using functions measurefun and Factorm, where the measure of cononical correlation is used. The larger its value, the better.

measurefun(gfm3$hH, dat$H0, type='ccor')

## [1] 0.9560164

measurefun(gfm3$hB, dat$B0, type='ccor')

## [1] 0.9274218

Fac <- Factorm(dat$X)
measurefun(Fac$hH, dat$H0, type='ccor')

## [1] 0.8904272

measurefun(Fac$hB, dat$B0, type='ccor')

## [1] 0.8970445

The above results show that GFM can produce better estimators by using the information of heterogeous variances.