Building queries with the Python SDK

In the following notebook, we will show how to build complex queries in GOR using the Python SDK to connect to our instance. First, as always, we load the gor magic extension to be able to use the %gor and %%gor syntax. This notebook assumes you are familiar with the gor syntax.

[1]:

# Install the python sdk, if needed
! pip install nextcode-sdk[jupyter] -U > /dev/null

[2]:

# load the magic extension and imports
%reload_ext nextcode
import nextcode
import pandas as pd
%env LOG_QUERY=1

 Gor magic extension has been loaded. You can now use '%gor' and '%%gor' in this notebook
 * Python SDK Version: 0.8.1
 * Query API Version: 1.10.0
 * GOR Version: 2.6.0 (git SHA 622555a)
 * Root Endpoint: https://platform.wuxinextcodedev.com/api/query/
 * Current User: jonb@wuxinextcode.com
 * Current Project: ukbb_hg38 (%env GOR_API_PROJECT=xxx)
env: LOG_QUERY=1

Simple single-line gor expression

[3]:

%gor nor #dbsnp# | top 2

Query 19503 generated 2 rows in 0.84 sec

[3]:

	Chrom	pos	reference	allele	rsids
0	chr1	10020	AA	A	rs775809821
1	chr1	10039	A	C	rs978760828

Single-line gor expression which stores results in local variable

[4]:

var = %gor nor #dbsnp# | top 2

Query 19504 generated 2 rows in 0.73 sec

[5]:

print(var)

  Chrom    pos reference allele        rsids
0  chr1  10020        AA      A  rs775809821
1  chr1  10039         A      C  rs978760828

Simple multi-line gor expression

Multi-line expressions are started with %%. Each statement should end with ;

[6]:

%%gor
  def #top# = 2;
  gor #dbsnp#
   | top #top#;

Query 19505 generated 2 rows in 0.77 sec

[6]:

	Chrom	pos	reference	allele	rsids
0	chr1	10020	AA	A	rs775809821
1	chr1	10039	A	C	rs978760828

Multi-line gor expression which stores results in local variable

To load results from a multi-line gor expression into a variable you can use the special << operator

[7]:

%%gor new_results <<
def #tops# = 10;
gor #dbsnp# | top #tops#;

Query 19506 generated 10 rows in 0.86 sec

[8]:

new_results.index

[8]:

RangeIndex(start=0, stop=10, step=1)

Store results in the user_data folder

Instead of fetching the data to the jupyter server you can also use the | write gor command to store files in the user_data folder within the project on NFS.

[9]:

%%gor
  pgor #dbsnp#
   | top 1000
   | write user_data/outfile.gorz;

Query 19507 generated 0 rows in 0.86 sec

[9]:

[10]:

# verify that the file is there by executing gorls
%gorls user_data outfile.gorz

outfile.gorz (10.2KiB)

[11]:

# now you can gor this file and bring the results into a local dataframe
%gor user_data/outfile.gorz | GROUP chrom -count | top 2

Query 19509 generated 1 rows in 0.73 sec

[11]:

	Chrom	bpStart	bpStop	allCount
0	chr7	0	159345973	1000

Download results to a local file

When the results are very large and you want to be able to use them locally without rerunning the query you can stream the results directly to a file with the << operator by specifying file:. You can cancel the download at any time and have partial results in the output file.

[ ]:

filename = "dbsnp_2m.tsv"

[ ]:

%%gor file:$filename <<
    gor #dbsnp# | top 2000000

[ ]:

!wc -l $filename
!ls -lh $filename

[ ]:

[ ]:

Gor expression which references a local variable

Local variables can be references using the $xxx syntax. This will expand the variable inline and is meant to be used for simple strings and numbers

[ ]:

num = 5

[ ]:

%%gor
  gor -p chr1:69000-70000 #dbsnp#
   | TOP $num;

Gor expression with defs and create statements

[ ]:

%%gor
def #number# = $num;
def #VEP_single# = UKBB/freezes/2019_0325/plink_wes/vep_single.gorz;
def #myfavouritefood# = pizza;

create mydbsnp = gor -p chr1:69000-70000 #dbsnp#
                  | TOP #number#;

gor [mydbsnp];

Compound expressions

Using the local variables you can construct compount expressions, allowing you to split up long gor scripts.

[ ]:

defs = """
def #myfirstdef# = 2;
def #myseconddef# = 2;

"""
creates = """
create mydbsnp = gor -p chr1:69000-70000 #dbsnp#
                  | TOP #myfirstdef#;
"""

[ ]:

%%gor
 $defs
 $creates

 gor [mydbsnp];

Virtual relations

You can load pandas dataframes into the query with the special [var:xxx] syntax.

[ ]:

allpns = pd.read_csv('pnlist.tsv', delimiter='\t')

[ ]:

%%gor
  create bla = gor #dbsnp#
   | top 10;
  nor [var:allpns]
   | top 5;

Cancelling queries

Try executing this long-running query and then click the stop button

[ ]:

results = %gor pgor #dbsnp# | GROUP chrom -count

Using the SDK without magic syntax

The % and %% is simply syntactic sugar for executing queries through the python sdk. You can use the sdk directly as well and that is probably the preferred method for serious analysis.

Initializing the SDK

The SDK usually needs to be initialized with an API key and a project. However, in Wuxi notebooks this has already been done for you.

[ ]:

#%env GOR_API_KEY=[jwt token]
#%env GOR_API_PROJECT=[project name]
import nextcode
import pandas
# get a handle on a query service proxy
svc = nextcode.get_service("query")

Running basic queries

[ ]:

qry = svc.execute("gor #dbsnp# | top 2;")
print(f"Query {qry.query_id} is {qry.status}. Runtime was {qry.duration} ms and it produced {qry.line_count} rows.")
print(f"The query that was run: {qry.query}")
from pprint import pprint
df = qry.dataframe()
df

Analysing failed queries

[ ]:

qry = svc.execute("gor #unknowndef# | top 2;")
if qry.status == "FAILED":
    import sys
    print(f"Query failed with error: {qry.error['message']}\n", file=sys.stderr)
    print(f"Additional information available from qry.error[]: {list(qry.error.keys())}", file=sys.stderr)

Using virtual relations

[ ]:

import pandas as pd
import nextcode
svc = nextcode.get_service("query")

# use a dataframe as a virtual relation
df = pd.DataFrame(columns=["hello", "world"], data=[[1, 2]])
qry = svc.execute("gor [test]", test=df)
print(qry.dataframe())

# use a tsv string as a virtual relation
tsv = """hello\tworld\n1\t2"""
qry = svc.execute("gor [test]", test=tsv)
print(qry.dataframe())

# if your virtual relation uses special characters (:) in variable name
qry = svc.execute("gor [var:test]", relations=[{"name": "var:test", "data": df}])
print(qry.dataframe())