vignettes/gor-create.Rmd
gor-create.Rmd
In this example, we’re going to show how the gor_create
can be used to prepare and construct a query
closure.
This both reduces repetitions in code, as well as simplifies iterative
workflows in GOR.
First load the gorr
package, the tidyverse
package is recommended, but for the sake of simplicity we pick out the
ones we’re using:
library(gorr)
library(magrittr) # pipe
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Next we make a conn
object for holding information on
the API we’re connecting to. gor_connect
takes 2 parameters
api_key
and project
but if either are left out
then it will try to read the environment variables
GOR_API_KEY
, and GOR_API_PROJECT
respectively.
Here below we have the GOR_API_KEY
environment variable
already defined so supplying the function only with a target project
suffices. After this we create a query
function/closure so
we don’t have to reference conn
again:
conn <- gor_connect(project = "ukbb_hg38")
#> Warning: 'gor_connect' is deprecated.
#> Use 'platform_connect' instead.
#> See help("Deprecated")
query <- gor_create(conn = conn)
query
#> ── GOR Creation Query ──────────────────────────────────────────────────────────
#> Connection
#> Service Root: https://platform.wuxinextcodedev.com/api/query
#> Project: ukbb_hg38
#> Definitions
#> None
#> Create statements & virtual relations
#> None
Now we can call that function with only the query parameter. Let’s
search for genes containing BRCA and save the resulting table
as a local dataframe mygenes
:
mygenes <- query("gor #genes# | grep BRCA")
mygenes
#> # A tibble: 3 × 4
#> chrom gene_start gene_end gene_symbol
#> <chr> <int> <int> <chr>
#> 1 chr13 32315085 32400268 BRCA2
#> 2 chr17 43044294 43170245 BRCA1
#> 3 chr17 43168169 43168249 BRCA1P1
Next we can expand on our previously defined query
function by supplying it back into gor_create
as the
replace
parameter. This time we include some definitions
using the defs
parameter and then we can alias our local
dataframe so that we can reference it in remote queries. In GOR this is
called virtual relations:
query <- gor_create(
defs = "def variants = #dbsnp#",
mygenes = mygenes,
replace = query
)
query
#> ── GOR Creation Query ──────────────────────────────────────────────────────────
#> Connection
#> Service Root: https://platform.wuxinextcodedev.com/api/query
#> Project: ukbb_hg38
#> Definitions
#> def variants = #dbsnp#;
#> Create statements & virtual relations
#> mygenes
#> # A tibble: 3 × 4
#> chrom gene_start gene_end gene_symbol
#> <chr> <int> <int> <chr>
#> 1 chr13 32315085 32400268 BRCA2
#> 2 chr17 43044294 43170245 BRCA1
#> 3 chr17 43168169 43168249 BRCA1P1
Now that we have our updated query
function, we can use
it to gor
our table of genes and join it to the
#dbsnp#
table we aliased as variants
in the
definitions part above. The result is a list of all variants within each
gene in our table
brca_variants <- query("
gor [mygenes] | join -segvar variants
")
brca_variants %>%
group_by(gene_symbol) %>%
summarize(records = n(),
variants = n_distinct(rsids))
#> # A tibble: 3 × 3
#> gene_symbol records variants
#> <chr> <int> <int>
#> 1 BRCA1 60548 49772
#> 2 BRCA1P1 34 31
#> 3 BRCA2 42346 35724
The reason for the difference in # records
and
# variants
above can be explained by looking into the
data:
target_variant <-
brca_variants %>%
group_by(rsids) %>%
count() %>%
ungroup() %>%
arrange(desc(n)) %>%
head(n = 1) %>%
pull(rsids)
target_variant
#> [1] "rs397838402"
brca_variants %>% filter(rsids == target_variant) %>% select(-distance)
#> # A tibble: 48 × 8
#> chrom gene_start gene_end gene_symbol pos reference allele rsids
#> <chr> <int> <int> <chr> <int> <chr> <chr> <chr>
#> 1 chr13 32315085 32400268 BRCA2 32395970 TTTTTTTTTTTTTTTT… T rs39…
#> 2 chr13 32315085 32400268 BRCA2 32395971 TTTTTTTTTTTTTTTT… T rs39…
#> 3 chr13 32315085 32400268 BRCA2 32395972 TTTTTTTTTTTTTTTTT T rs39…
#> 4 chr13 32315085 32400268 BRCA2 32395973 TTTTTTTTTTTTTTTT T rs39…
#> 5 chr13 32315085 32400268 BRCA2 32395974 TTTTTTTTTTTTTTT T rs39…
#> 6 chr13 32315085 32400268 BRCA2 32395975 TTTTTTTTTTTTTT T rs39…
#> 7 chr13 32315085 32400268 BRCA2 32395976 TTTTTTTTTTTTT T rs39…
#> 8 chr13 32315085 32400268 BRCA2 32395977 TTTTTTTTTTTT T rs39…
#> 9 chr13 32315085 32400268 BRCA2 32395978 TTTTTTTTTTT T rs39…
#> 10 chr13 32315085 32400268 BRCA2 32395979 TTTTTTTTTT T rs39…
#> # … with 38 more rows