Takeru Elysia news!¶

大規模GWASやPRS研究を可能にするHail(2)¶

Takeru ElysiaはApache Sparkが利用可能な大規模計算プラットフォームです。従来のジョブスケジューラーを用いたものと異なり、複数台の計算機を使う同じcluster構成でも、次のような特徴を持っています。

自分でデータ分割をする必要がない
並列処理を自動実行してくれる
計算にnfsを使用しないため、ファイルI/Oがボトルネックになりにくい
高い拡張性を持っており、数ノードからスタートして数百ノード以上にスケールすることができる

動作するソフトウェアやライブラリの中にはGATKといったメジャーなものもありますが今回は大規模なGWASやPRS(Polygenic Risk Score)を用いた研究ができる、Hailについてお話しします。近年、各種バイオバンクが充実してきており、数年まではできなかったことが可能になってきているようです。Apache Spark上で動作するHailなら、いくつものバイオバンクのデータを読み込み、自在にフィルターや変換、アノテーションをしていくことで新しい知見をより早く得ることができます。

今日の内容: Hail上にplink形式、任意のテキスト形式、bgen形式のファイルを読み込み、データ構造や内容を閲覧してみます。¶

使用するファイルについて: plink形式と任意のテキストファイルは、PRSのTutorial、「Basic Tutorial for Polygenic Risk Score Analysis」にあるものを使いました。 https://choishingwan.github.io/PRS-Tutorial/ https://github.com/choishingwan/PRS-Tutorial/raw/master/resources/EUR.zip をダウンロードし、展開、次の3つのファイルを読み込みます。 EUR.bed,EUR.bim,EUR.fam EUR.bedはバイナリで、pedファイルに変換して中を見てみると次のようになっています。とても横に長いファイルです。

0 HG00096 0 0 1 -9 GA AG AT GA TA ....

EUR.bimは次のような内容が入っています。

$ head EUR.bim 
1   rs3094315   0   752566  G   A
1   rs2905035   0   775659  A   G
1   rs2980319   0   777122  A   T
1   rs4040617   0   779322  G   A
1   rs2977612   0   780785  T   A
1   rs2905062   0   785050  G   A
1   rs2980300   0   785989  T   C
1   rs3121561   0   990380  T   C
1   rs3813193   0   998501  C   G
1   rs4075116   0   1003629 C   T

EUR.famは次のような内容が入っています。

$ head EUR.fam 
0   HG00096 0   0   1   -9
0   HG00097 0   0   2   -9
0   HG00099 0   0   2   -9
0   HG00100 0   0   2   -9
0   HG00101 0   0   1   -9
0   HG00102 0   0   2   -9
0   HG00103 0   0   1   -9
0   HG00105 0   0   1   -9
0   HG00106 0   0   2   -9
0   HG00107 0   0   1   -9

また、任意のテキストファイルとして同じTutorialで使用している、EUR.heightを使います。

$ head EUR.height 
FID IID Height
0 HG00096 167.18
0 HG00097 169.85
0 HG00099 170.6
0 HG00100 166.78
0 HG00101 165.03
0 HG00102 170.2
0 HG00103 163.2
0 HG00105 164.92
0 HG00106 174.18

bgenファイルはUK Biobankで配布される形式ですが、次の場所に小さいサンプルデータがありますのでそれを使っています。 https://bitbucket.org/gavinband/bgen/src/release/example/ HailはPython上で動作します。まず、notebook上でHailを動作させます。この試験ではkubernetes上にspark clusterをデプロイしています。

plinkファイルの読み込み¶

In [2]:

plink_data = hl.import_plink(
    bed="data/EUR.bed", bim="data/EUR.bim", fam="data/EUR.fam",
    quant_pheno=True, missing='-9')

2020-06-04 13:23:45 Hail: INFO: Found 503 samples in fam file.
2020-06-04 13:23:45 Hail: INFO: Found 2395907 variants in bim file.

データの確認¶

hl.import_plinkはplink形式のbed,bim,famファイルをmatrixtableとして読み込みます。

In [3]:

plink_data

Out[3]:

<hail.matrixtable.MatrixTable at 0x7fd6ea408d90>

データの内容確認¶

まず、データの中身を少し見てみましょう。 hailのmatrixtableは次の4つのfieldsからできています。

column field
row field
entry field
global fieldです。

まず、データの構造を確認します。

describe¶

In [4]:

plink_data.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fam_id': str
    'pat_id': str
    'mat_id': str
    'is_female': bool
    'quant_pheno': float64
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'cm_position': float64
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

show¶

データの中身を見てみます

In [5]:

plink_data.show()

2020-06-04 13:23:57 Hail: INFO: Ordering unsorted dataset with network shuffle
2020-06-04 13:24:05 Hail: INFO: Ordering unsorted dataset with network shuffle

locus	alleles	HG00096.GT	HG00097.GT	HG00099.GT	HG00100.GT
locus<GRCh37>	array<str>	call	call	call	call
1:752566	["A","G"]	0/1	0/0	0/1	0/0
1:775659	["G","A"]	0/1	0/0	0/1	0/0
1:777122	["T","A"]	0/1	0/0	0/1	0/0
1:779322	["A","G"]	0/1	0/0	0/1	0/0
1:780785	["A","T"]	0/1	0/0	0/1	0/0
1:785050	["A","G"]	0/1	0/0	0/1	0/0
1:785989	["C","T"]	0/1	0/0	0/1	0/0
1:990380	["C","T"]	0/1	0/1	0/0	0/1
1:998501	["G","C"]	0/0	0/1	0/0	0/1
1:1003629	["T","C"]	1/1	0/0	0/0	0/0

showing top 10 rows

showing the first 4 of 503 columns

In [6]:

plink_data.entries().show()

2020-06-04 13:24:57 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
    To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'
2020-06-04 13:25:05 Hail: INFO: Ordering unsorted dataset with network shuffle

locus	alleles	rsid	cm_position	s	fam_id	pat_id	mat_id	is_female	quant_pheno	GT
locus<GRCh37>	array<str>	str	float64	str	str	str	str	bool	float64	call
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00096"	NA	NA	NA	false	NA	0/1
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00097"	NA	NA	NA	true	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00099"	NA	NA	NA	true	NA	0/1
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00100"	NA	NA	NA	true	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00101"	NA	NA	NA	false	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00102"	NA	NA	NA	true	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00103"	NA	NA	NA	false	NA	NA
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00105"	NA	NA	NA	false	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00106"	NA	NA	NA	true	NA	0/0
1:752566	["A","G"]	"rs3094315"	0.00e+00	"HG00107"	NA	NA	NA	false	NA	0/1

showing top 10 rows

In [7]:

plink_data.cols().show()

2020-06-04 13:25:52 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'

s	fam_id	pat_id	mat_id	is_female	quant_pheno
str	str	str	str	bool	float64
"HG00096"	NA	NA	NA	false	NA
"HG00097"	NA	NA	NA	true	NA
"HG00099"	NA	NA	NA	true	NA
"HG00100"	NA	NA	NA	true	NA
"HG00101"	NA	NA	NA	false	NA
"HG00102"	NA	NA	NA	true	NA
"HG00103"	NA	NA	NA	false	NA
"HG00105"	NA	NA	NA	false	NA
"HG00106"	NA	NA	NA	true	NA
"HG00107"	NA	NA	NA	false	NA

showing top 10 rows

In [8]:

plink_data.rows().show()

2020-06-04 13:26:02 Hail: INFO: Ordering unsorted dataset with network shuffle

locus	alleles	rsid	cm_position
locus<GRCh37>	array<str>	str	float64
1:752566	["A","G"]	"rs3094315"	0.00e+00
1:775659	["G","A"]	"rs2905035"	0.00e+00
1:777122	["T","A"]	"rs2980319"	0.00e+00
1:779322	["A","G"]	"rs4040617"	0.00e+00
1:780785	["A","T"]	"rs2977612"	0.00e+00
1:785050	["A","G"]	"rs2905062"	0.00e+00
1:785989	["C","T"]	"rs2980300"	0.00e+00
1:990380	["C","T"]	"rs3121561"	0.00e+00
1:998501	["G","C"]	"rs3813193"	0.00e+00
1:1003629	["T","C"]	"rs4075116"	0.00e+00

showing top 10 rows

In [9]:

plink_data.globals_table().show()

テキストファイルの読み込み¶

hailではタブ区切りやカンマ区切り、スペース区切りといった任意のテキストファイルを読み込むことができます。任意の大規模な表現型のデータを簡単に扱うことができます。

delimiter='\s+' を指定することでスペース区切りのファイルを読むことができます。

In [10]:

height = hl.import_table('data/EUR.height', 
                      impute=True,
                      delimiter='\s+',
                      key='IID')

2020-06-04 13:26:10 Hail: INFO: Reading table to impute column types
2020-06-04 13:26:10 Hail: INFO: Finished type imputation
  Loading column 'FID' as type 'int32' (imputed)
  Loading column 'IID' as type 'str' (imputed)
  Loading column 'Height' as type 'float64' (imputed)

In [11]:

height.show()

FID	IID	Height
int32	str	float64
0	"HG00096"	1.67e+02
0	"HG00097"	1.70e+02
0	"HG00099"	1.71e+02
0	"HG00100"	1.67e+02
0	"HG00101"	1.65e+02
0	"HG00102"	1.70e+02
0	"HG00103"	1.63e+02
0	"HG00105"	1.65e+02
0	"HG00106"	1.74e+02
0	"HG00107"	1.59e+02

showing top 10 rows

hail table¶

ここで読み込んだデータは、matrixtableではなく、tableと呼ばれる形式で扱われます。 hailには、matrixtableとtableがあります。

In [12]:

height

Out[12]:

<hail.table.Table at 0x7fd6e9157bd0>

tableでもfilterなどの操作が可能で、例えば次のようにすると身長が180cm以上のデータに絞り込むことができます。

In [13]:

height.filter(height.Height > 180 ).show()

FID	IID	Height
int32	str	float64
0	"HG00154"	1.82e+02
0	"NA12890"	1.80e+02

bgen形式のファイルを読み込み¶

In [14]:

hl.index_bgen("data/example.8bits.bgen",
               contig_recoding={"01": "1"},
               reference_genome='GRCh37')

2020-06-04 13:26:17 Hail: INFO: Number of BGEN files indexed: 1

In [15]:

bgen = hl.import_bgen("data/example.8bits.bgen",
                      entry_fields=['GT', 'GP'], 
                      sample_file="data/example.sample")

2020-06-04 13:26:21 Hail: INFO: Number of BGEN files parsed: 1
2020-06-04 13:26:21 Hail: INFO: Number of samples in BGEN files: 500
2020-06-04 13:26:21 Hail: INFO: Number of variants across all BGEN files: 199

bgen形式はhailではmatrixtableとして扱われます。 vcf形式やplink形式と同様に、describeやshowを使用して中身を見てみます。

In [16]:

bgen

Out[16]:

<hail.matrixtable.MatrixTable at 0x7fd6e90eead0>

In [17]:

bgen.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
----------------------------------------
Entry fields:
    'GT': call
    'GP': array<float64>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

In [18]:

bgen.show()

locus	alleles	sample_001.GT	sample_001.GP
locus<GRCh37>	array<str>	call	array<float64>
1:1001	["A","G"]	1/1	[0.00e+00,3.92e-03,9.96e-01]
1:2000	["A","G"]	NA	NA
1:2001	["A","G"]	0/0	[9.18e-01,7.84e-03,7.45e-02]
1:3000	["A","G"]	1/1	[3.92e-03,3.92e-03,9.92e-01]
1:3001	["A","G"]	0/0	[9.92e-01,3.92e-03,3.92e-03]
1:4000	["A","G"]	0/0	[9.92e-01,0.00e+00,7.84e-03]
1:4001	["A","G"]	1/1	[7.84e-03,0.00e+00,9.92e-01]
1:5000	["A","G"]	1/1	[3.92e-03,2.75e-02,9.69e-01]
1:5001	["A","G"]	0/0	[9.69e-01,2.75e-02,3.92e-03]
1:6000	["A","G"]	0/1	[3.92e-03,9.96e-01,0.00e+00]

showing top 10 rows

showing the first 1 of 500 columns

In [19]:

bgen.col.show()

s
str
"sample_001"
"sample_002"
"sample_003"
"sample_004"
"sample_005"
"sample_006"
"sample_007"
"sample_008"
"sample_009"
"sample_010"

showing top 10 rows

In [20]:

bgen.row.show()

locus	alleles	rsid	varid
locus<GRCh37>	array<str>	str	str
1:1001	["A","G"]	"RSID_101"	"SNPID_101"
1:2000	["A","G"]	"RSID_2"	"SNPID_2"
1:2001	["A","G"]	"RSID_102"	"SNPID_102"
1:3000	["A","G"]	"RSID_3"	"SNPID_3"
1:3001	["A","G"]	"RSID_103"	"SNPID_103"
1:4000	["A","G"]	"RSID_4"	"SNPID_4"
1:4001	["A","G"]	"RSID_104"	"SNPID_104"
1:5000	["A","G"]	"RSID_5"	"SNPID_5"
1:5001	["A","G"]	"RSID_105"	"SNPID_105"
1:6000	["A","G"]	"RSID_6"	"SNPID_6"

showing top 10 rows

In [21]:

bgen.entry.show()

locus	alleles	sample_001.GT	sample_001.GP
locus<GRCh37>	array<str>	call	array<float64>
1:1001	["A","G"]	1/1	[0.00e+00,3.92e-03,9.96e-01]
1:2000	["A","G"]	NA	NA
1:2001	["A","G"]	0/0	[9.18e-01,7.84e-03,7.45e-02]
1:3000	["A","G"]	1/1	[3.92e-03,3.92e-03,9.92e-01]
1:3001	["A","G"]	0/0	[9.92e-01,3.92e-03,3.92e-03]
1:4000	["A","G"]	0/0	[9.92e-01,0.00e+00,7.84e-03]
1:4001	["A","G"]	1/1	[7.84e-03,0.00e+00,9.92e-01]
1:5000	["A","G"]	1/1	[3.92e-03,2.75e-02,9.69e-01]
1:5001	["A","G"]	0/0	[9.69e-01,2.75e-02,3.92e-03]
1:6000	["A","G"]	0/1	[3.92e-03,9.96e-01,0.00e+00]

showing top 10 rows

showing the first 1 of 500 columns

今回は次のような操作を行いました。

plink形式(bed,bim,fam)を読み込み
任意のテキストファイルを読み込み
bgen形式を読み込み
各ファイルの閲覧

次回の記事では今回読み込んだplink形式のデータに、任意のテキストファイルとして読み込んだ身長のデータを結合してみようと思います。

In [ ]:

Nabe International | Takeru Boost >> Takeruを回せ！ >>

大規模GWASやPRS研究を可能にするHail(2)

Takeru Elysia news!¶

大規模GWASやPRS研究を可能にするHail(2)¶

今日の内容: Hail上にplink形式、任意のテキスト形式、bgen形式のファイルを読み込み、データ構造や内容を閲覧してみます。¶

plinkファイルの読み込み¶

データの確認¶

データの内容確認¶

describe¶

show¶

テキストファイルの読み込み¶

hail table¶

bgen形式のファイルを読み込み¶

前の記事

次の記事