vignettes/persistance-of-results.Rmd
persistance-of-results.Rmd
In this example, we’re going to demonstrate the result persistance
feature in the Query API. Query results can be saved to a file on the
server, usually under the user_data/
directory, instead of
loading them into memory on the client. This is useful e.g. when you’re
preparing case-control
files or covariates
for
use as input into regression workflows, e.g. PLINK
.
First load the gorr
package, the tidyverse
package is recommended, but for the sake of simplicity we pick out the
ones we’re using:
library(gorr)
library(magrittr) # pipe
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Next we make a conn
object for holding information on
the API we’re connecting to. platform_connect
takes 2
parameters api_key
and project
but if either
are left out then it will try to read the environment variables
GOR_API_KEY
, and GOR_API_PROJECT
respectively.
Here below we have the GOR_API_KEY
environment variable
already defined so supplying the function only with a target project
suffices. After this we create a query
function/closure so
we don’t have to reference conn
again:
conn <- platform_connect(project = "ukbb_hg38")
query <- gor_create(conn = conn)
Now we can call that function with our query, along with a
persist
parameter for storing the results on the server.
Let’s make a .tsv file of the first 1000 SNPs in dbSNP:
query("gor #dbsnp# | top 1000", persist = "user_data/doc/dbsnp_1000.gorz")
#>
#> ────────────────────────────────────────────────────────────────────────────────
#> Warning: Persisting results not allowed using 'queryserver' in R-sdk.
#> Switching to 'queryservice'. Please add 'write' statement to the GOR query for
#> persisting results in project if you want to use 'queryserver'.
query("gor #dbsnp# | top 1000", persist = "user_data/doc/dbsnp_1000.tsv")
#>
#> ────────────────────────────────────────────────────────────────────────────────
#> Warning: Persisting results not allowed using 'queryserver' in R-sdk.
#> Switching to 'queryservice'. Please add 'write' statement to the GOR query for
#> persisting results in project if you want to use 'queryserver'.
Note that in this example, for the sake of demonstrating two
different file types, we make two calls. One creates a GORz
file (compressed in genomic order), the other a simple tsv
file. Now we can read the file contents directly:
query("gor user_data/doc/dbsnp_1000.gorz | top 10")
#> # A tibble: 10 × 5
#> chrom pos reference allele rsids
#> <chr> <int> <chr> <chr> <chr>
#> 1 chr1 10001 T A rs1570391677
#> 2 chr1 10001 T C rs1570391677
#> 3 chr1 10002 A C rs1570391692
#> 4 chr1 10003 A C rs1570391694
#> 5 chr1 10007 T C rs1639538116
#> 6 chr1 10007 T G rs1639538116
#> 7 chr1 10008 A C rs1570391698
#> 8 chr1 10008 A G rs1570391698
#> 9 chr1 10008 A T rs1570391698
#> 10 chr1 10009 A C rs1570391702
For reading the tsv file, we use NOR
which is suitable
for reading any tab-separated data-files and don’t assume genomic
order:
query("nor user_data/doc/dbsnp_1000.tsv | top 10")
#> # A tibble: 10 × 3
#> reference allele rsids
#> <chr> <chr> <chr>
#> 1 T A rs1570391677
#> 2 T C rs1570391677
#> 3 A C rs1570391692
#> 4 A C rs1570391694
#> 5 T C rs1639538116
#> 6 T G rs1639538116
#> 7 A C rs1570391698
#> 8 A G rs1570391698
#> 9 A T rs1570391698
#> 10 A C rs1570391702
It is sometimes useful to list the files of a given directory, we can
use nor
to do that:
query("nor user_data/doc/") %>%
transmute(Filename, Filesize = fs::as_fs_bytes(Filesize), Filetype)
#> # A tibble: 3 × 3
#> Filename Filesize Filetype
#> <chr> <fs::bytes> <chr>
#> 1 doc 25K ""
#> 2 dbsnp_1000.gorz 9.12K "gorz"
#> 3 dbsnp_1000.tsv 21.1K "tsv"
The table above also shows the size difference between the
gorz
and the tsv
file