大規模GWASやPRS研究を可能にするHail(1)
Takeru Elysia news!¶
大規模GWASやPRS研究を可能にするHail(1)¶
Takeru ElysiaはApache Sparkが利用可能な大規模計算プラットフォームです。従来のジョブスケジューラーを用いたものと異なり、複数台の計算機を使う同じcluster構成でも、次のような特徴を持っています。- 自分でデータ分割をする必要がない
- 並列処理を自動実行してくれる
- 計算にnfsを使用しないため、ファイルI/Oがボトルネックになりにくい
- 高い拡張性を持っており、数ノードからスタートして数百ノード以上にスケールすることができる
今日の内容: Hail上にvcfファイルを読み込み、データ構造や内容を閲覧してみます。¶
使用するvcfファイルについて: 使用するvcfファイルは、1000genomesより取得したものです。一度展開してひとつにまとめ、'chr'を取り除き、bgzipしていて、圧縮した状態で合計1.3TB程度あります。$ ls /zfs-archive/1000genomes/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/20190425_NYGC_GATK/CCDG_13607_B01_GRM_WGS_2019-02-19_*vcf.gz -v > list
$ ./vcf-concat `cat list` |awk '{gsub(/^chr/,""); print}' | /usr/local/miniconda3/bin/bgzip > 1000genomes_GRCh38.vcf.bgz
In [1]:
import
os
os
.
environ
[
'PATH'
]
=
'/usr/java/default/bin:/usr/local/spark/bin:/usr/local/miniconda3/bin:/home/testuser1/.local/bin:/home/testuser1/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin'
os
.
environ
[
'HAIL_DIR'
]
=
'/usr/local/miniconda3/lib/python3.7/site-packages/hail'
os
.
environ
[
'JAVA_HOME'
]
=
'/usr/java/default'
os
.
environ
[
'HADOOP_HOME'
]
=
'/usr/local/hadoop'
os
.
environ
[
'HADOOP_CONF_DIR'
]
=
'/usr/local/hadoop/etc/hadoop'
os
.
environ
[
'SPARK_HOME'
]
=
'/usr/local/spark'
os
.
environ
[
'PYSPARK_SUBMIT_ARGS'
]
=
'--jars
\
/usr/local/miniconda3/lib/python3.7/site-packages/hail/hail-all-spark.jar
\
--conf spark.executor.cores=30
\
--conf spark.executor.instances=4
\
--conf spark.driver.memory=20G
\
--conf spark.executor.memory=300G
\
--conf spark.executor.extraLibraryPath=/usr/local/miniconda3/lib
\
--conf spark.driver.extraClassPath=/usr/local/miniconda3/lib/python3.7/site-packages/hail/hail-all-spark.jar
\
--conf spark.executor.extraClassPath=./hail-all-spark.jar
\
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
\
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
\
--files /usr/local/miniconda3/lib/python3.7/site-packages/hail/hail-all-spark.jar
\
--conf spark.kubernetes.driverEnv.SPARK_USER=testuser1
\
--conf spark.kubernetes.driverEnv.HADOOP_USER_NAME=testuser1
\
--conf spark.executorEnv.HADOOP_USER_NAME=testuser1
\
--conf spark.executorEnv.SPARK_USER=testuser1
\
pyspark-shell '
os
.
environ
[
'HAIL_DIR'
]
=
'/usr/local/miniconda3/lib/python3.7/site-packages/hail'
from
pyspark
import
SparkContext
,
SparkConf
from
pyspark.sql
import
SparkSession
sparkConf
=
SparkConf
()
sparkConf
.
setMaster
(
"k8s://https://ooo.ooo.ooo.ooo:6443"
)
sparkConf
.
setAppName
(
'hailonk8s'
)
sparkConf
.
set
(
'spark.kubernetes.authenticate.driver.serviceAccountName'
,
'spark'
)
sparkConf
.
set
(
'spark.submit.deployMode'
,
'client'
)
sparkConf
.
set
(
"spark.kubernetes.container.image"
,
"ooo.ooo.ooo.ooo:5000/spark-h321/spark-py:2.4.5"
)
sparkConf
.
set
(
"spark.kubernetes.namespace"
,
"default"
)
sparkConf
.
set
(
'spark.driver.host'
,
'ooo.ooo.ooo.ooo'
)
os
.
environ
[
'PYSPARK_PYTHON'
]
=
'/usr/bin/python3'
os
.
environ
[
'PYSPARK_DRIVER_PYTHON'
]
=
'/usr/local/miniconda3/bin/python'
sc
=
SparkContext
(
conf
=
sparkConf
)
import
hail
as
hl
hl
.
init
(
sc
=
sc
,
tmp_dir
=
'hdfs://spk000:9000/tmp'
)
from
hail.plot
import
show
from
pprint
import
pprint
hl
.
plot
.
output_notebook
()
/usr/local/miniconda3/lib/python3.7/site-packages/hail/context.py:71: UserWarning: pip-installed Hail requires additional configuration options in Spark referring
to the path to the Hail Python module directory HAIL_DIR,
e.g. /path/to/python/site-packages/hail:
spark.jars=HAIL_DIR/hail-all-spark.jar
spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
spark.executor.extraClassPath=./hail-all-spark.jar
'pip-installed Hail requires additional configuration options in Spark referring\n'
Running on Apache Spark version 2.4.5
SparkUI available at http://ooo.ooo.ooo.ooo:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.34-914bd8a10ca2
LOGGING: writing to /home/testuser1/tutorial/hail-20200528-0859-0.2.34-914bd8a10ca2.log
Loading BokehJS ...
vcfファイルの読み込み¶
vcfファイルから読み込む場合は、hailで高速に扱うためにMatrixtableへ変換する必要があります。 次のように、hl.import_vcfを使用して実行することができます。 ファイルが大きいためこの処理には時間がかかりますが、1度行えば再利用が可能です。In [2]:
%%time
hl
.
import_vcf
(
'hdfs://spk000:9000/ectest/1000genomes_GRCh38.vcf.bgz'
,
reference_genome
=
'GRCh38'
)
\
.
write
(
'hdfs://spk000:9000/ectest/1kg-L.mt'
,
overwrite
=
True
)
2020-05-28 09:09:23 Hail: INFO: Coerced sorted dataset
CPU times: user 753 ms, sys: 575 ms, total: 1.33 s
Wall time: 1h 19min 58s
2020-05-28 10:19:58 Hail: INFO: wrote matrix table with 126659422 rows and 2504 columns in 9929 partitions to hdfs://spk000:9000/ectest/1kg-L.mt
変換したmatrixtableを読み込みます。
In [3]:
mt
=
hl
.
read_matrix_table
(
'hdfs://spk000:9000/ectest/1kg-L.mt'
)
データの確認¶
hdfs上のmatrixtable¶
出来上がったmatrixtableはhdfs上に次のようなファイルやディレクトリ群からなります。 これをファイル操作したり複数のファイル分割したりということはsparkがやってくれますので意識することはほとんどありません。In [4]:
!
hadoop fs -ls /ectest/1kg-L.mt
Found 9 items
-rw-r--r-- 1 testuser1 supergroup 147 2020-05-28 10:19 /ectest/1kg-L.mt/README.txt
-rw-r--r-- 1 testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/_SUCCESS
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/cols
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/entries
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/globals
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/index
-rw-r--r-- 1 testuser1 supergroup 22498 2020-05-28 10:19 /ectest/1kg-L.mt/metadata.json.gz
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/references
drwxr-xr-x - testuser1 supergroup 0 2020-05-28 10:19 /ectest/1kg-L.mt/rows
データの内容確認¶
まず、データの中身を少し見てみましょう。 hailのmatrixtableは次の4つのfieldsからできています。- column field
- row field
- entry field
- global fieldです。
show¶
In [5]:
mt
.
show
()
locus | alleles |
locus<GRCh38> | array<str> |
chr1:10109 | ["AACCCT","A"] |
chr1:10126 | ["T","C"] |
chr1:10140 | ["ACCCTAAC","A"] |
chr1:10143 | ["CTAACCCCT","C","*"] |
chr1:10145 | ["AACCCCT","*","ACCCT"] |
chr1:10146 | ["ACC","AC","*"] |
chr1:10147 | ["C","*","CCCTAA"] |
chr1:10150 | ["CT","TT","*","C"] |
chr1:10152 | ["A","G"] |
chr1:10153 | ["A","C"] |
showing top 10 rows
showing the first 0 of 2504 columns
In [6]:
mt
.
entries
()
.
show
()
2020-05-29 09:33:04 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'
locus | alleles | rsid | qual | filters | info.AC | info.AF | info.AN | info.BaseQRankSum | info.ClippingRankSum | info.DP | info.DS | info.END | info.ExcessHet | info.FS | info.HaplotypeScore | info.InbreedingCoeff | info.MLEAC | info.MLEAF | info.MQ | info.MQ0 | info.MQRankSum | info.NEGATIVE_TRAIN_SITE | info.POSITIVE_TRAIN_SITE | info.QD | info.RAW_MQ | info.ReadPosRankSum | info.SOR | info.VQSLOD | info.VariantType | info.culprit | s | AB | AD | DP | GQ | GT | MIN_DP | MQ0 | PGT | PID | PL | RGQ | SB |
locus<GRCh38> | array<str> | str | float64 | set<str> | array<int32> | array<float64> | int32 | float64 | float64 | int32 | bool | int32 | float64 | float64 | float64 | float64 | array<int32> | array<float64> | float64 | int32 | float64 | bool | bool | float64 | float64 | float64 | float64 | float64 | str | str | str | float64 | array<int32> | int32 | int32 | call | int32 | int32 | call | str | array<int32> | int32 | array<int32> |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00096" | NA | [77,0] | 77 | 0 | 0/0 | NA | NA | NA | NA | [0,0,982] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00097" | NA | [71,0] | 71 | 0 | 0/0 | NA | NA | NA | NA | [0,0,864] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00099" | NA | [100,0] | 100 | 0 | 0/0 | NA | NA | NA | NA | [0,0,1454] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00100" | NA | [4,1] | 5 | 8 | 0/0 | NA | NA | NA | NA | [0,8,125] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00101" | NA | [2,3] | 5 | 55 | 0/1 | NA | NA | NA | NA | [61,0,55] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00102" | NA | [87,0] | 87 | 0 | 0/0 | NA | NA | NA | NA | [0,0,1131] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00103" | NA | [60,0] | 60 | 0 | 0/0 | NA | NA | NA | NA | [0,0,853] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00105" | NA | [42,0] | 42 | 0 | 0/0 | NA | NA | NA | NA | [0,0,296] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00106" | NA | [64,0] | 64 | 0 | 0/0 | NA | NA | NA | NA | [0,0,885] | NA | NA |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" | "HG00107" | NA | [71,0] | 71 | 0 | 0/0 | NA | NA | NA | NA | [0,0,797] | NA | NA |
showing top 10 rows
In [7]:
mt
.
cols
()
.
show
()
2020-05-29 09:33:07 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
To preserve matrix table column order, first unkey columns with 'key_cols_by()'
s |
str |
"HG00096" |
"HG00097" |
"HG00099" |
"HG00100" |
"HG00101" |
"HG00102" |
"HG00103" |
"HG00105" |
"HG00106" |
"HG00107" |
showing top 10 rows
In [8]:
mt
.
rows
()
.
show
()
locus | alleles | rsid | qual | filters | info.AC | info.AF | info.AN | info.BaseQRankSum | info.ClippingRankSum | info.DP | info.DS | info.END | info.ExcessHet | info.FS | info.HaplotypeScore | info.InbreedingCoeff | info.MLEAC | info.MLEAF | info.MQ | info.MQ0 | info.MQRankSum | info.NEGATIVE_TRAIN_SITE | info.POSITIVE_TRAIN_SITE | info.QD | info.RAW_MQ | info.ReadPosRankSum | info.SOR | info.VQSLOD | info.VariantType | info.culprit |
locus<GRCh38> | array<str> | str | float64 | set<str> | array<int32> | array<float64> | int32 | float64 | float64 | int32 | bool | int32 | float64 | float64 | float64 | float64 | array<int32> | array<float64> | float64 | int32 | float64 | bool | bool | float64 | float64 | float64 | float64 | float64 | str | str |
chr1:10109 | ["AACCCT","A"] | NA | 7.84e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [21] | [4.21e-03] | 4988 | -1.67e+00 | -2.48e-01 | 181588 | false | NA | 2.15e+09 | 3.88e+01 | NA | -3.03e-01 | [7] | [1.40e-03] | 3.24e+01 | 0 | 7.50e-02 | true | false | 3.31e+01 | NA | -3.65e-01 | 3.20e+00 | -1.17e+01 | NA | "DP" |
chr1:10126 | ["T","C"] | NA | 1.43e+04 | {"VQSRTrancheSNP99.80to100.00"} | [3] | [6.68e-04] | 4488 | 2.55e-01 | -2.90e-01 | 162897 | false | NA | 2.15e+09 | 1.42e+01 | NA | -3.23e-01 | [2] | [4.46e-04] | 7.33e+00 | 0 | 3.48e-01 | false | false | 3.42e+01 | NA | -2.36e-01 | 1.03e+00 | -2.85e+01 | NA | "DP" |
chr1:10140 | ["ACCCTAAC","A"] | NA | 7.05e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [8] | [2.88e-03] | 2776 | 7.36e-01 | 7.36e-01 | 146862 | false | NA | 2.15e+09 | 5.06e+00 | NA | -1.91e-01 | [7] | [2.52e-03] | 2.14e+00 | 0 | 1.03e+00 | true | false | 3.06e+01 | NA | 5.72e-01 | 1.26e+00 | -3.25e+00 | NA | "DP" |
chr1:10143 | ["CTAACCCCT","C","*"] | NA | 9.19e+03 | {"VQSRTrancheINDEL99.00to100.00"} | [2,11] | [8.59e-04,4.73e-03] | 2328 | 0.00e+00 | -7.27e-01 | 145354 | false | NA | 1.60e-03 | 8.92e+00 | NA | 6.66e-01 | [2,8] | [8.59e-04,3.44e-03] | 1.56e+00 | 0 | 7.20e-01 | true | false | 2.91e+01 | NA | 7.27e-01 | 1.43e+00 | -3.30e+00 | NA | "DP" |
chr1:10145 | ["AACCCCT","*","ACCCT"] | NA | 2.52e+04 | {} | [16,6] | [1.20e-02,4.55e-03] | 1320 | 7.27e-01 | 0.00e+00 | 127064 | false | NA | 4.73e+00 | 1.97e+00 | NA | 1.11e-01 | [15,7] | [1.10e-02,5.30e-03] | 1.15e+00 | 0 | -7.36e-01 | true | false | 3.29e+01 | NA | 7.36e-01 | 8.95e-01 | -1.81e+00 | NA | "DP" |
chr1:10146 | ["ACC","AC","*"] | NA | 6.70e+04 | {} | [660,19] | [4.03e-01,1.20e-02] | 1638 | -5.37e-01 | -9.90e-02 | 104333 | false | NA | 4.80e+01 | 1.10e+01 | NA | 1.71e-01 | [645,17] | [3.94e-01,1.00e-02] | 9.38e+00 | 0 | 7.27e-01 | true | false | 2.17e+01 | NA | 7.27e-01 | 1.42e+00 | -1.16e+00 | NA | "FS" |
chr1:10147 | ["C","*","CCCTAA"] | NA | 6.73e+04 | {} | [679,1] | [4.18e-01,6.15e-04] | 1626 | -4.96e-01 | 2.48e-01 | 104178 | false | NA | 8.27e+01 | 1.06e+01 | NA | 1.60e-01 | [671,1] | [4.13e-01,6.15e-04] | 1.00e+00 | 0 | 7.20e-01 | true | false | 2.17e+01 | NA | 0.00e+00 | 1.43e+00 | -1.20e+00 | NA | "FS" |
chr1:10150 | ["CT","TT","*","C"] | NA | 2.26e+04 | {} | [6,9,5] | [1.60e-02,2.40e-02,1.30e-02] | 372 | -2.55e-01 | 5.72e-01 | 116528 | false | NA | 2.78e+00 | 5.70e+00 | NA | 3.29e-01 | [4,9,2] | [1.10e-02,2.40e-02,5.38e-03] | 1.19e+00 | 0 | -1.03e+00 | true | false | 3.08e+01 | NA | -5.53e-01 | 9.69e-01 | -1.61e+00 | NA | "FS" |
chr1:10152 | ["A","G"] | NA | 9.27e+03 | {"VQSRTrancheSNP99.80to100.00"} | [1] | [2.33e-03] | 430 | -1.03e+00 | 0.00e+00 | 117542 | false | NA | 1.57e+01 | 0.00e+00 | NA | -1.99e-01 | [1] | [2.33e-03] | 3.40e-01 | 0 | 1.03e+00 | false | false | 2.97e+01 | NA | -1.03e+00 | 5.27e-01 | -1.65e+01 | NA | "MQ" |
chr1:10153 | ["A","C"] | NA | 1.73e+04 | {"VQSRTrancheSNP99.80to100.00"} | [1] | [2.66e-03] | 376 | -7.36e-01 | -7.36e-01 | 116928 | false | NA | 5.53e+01 | 2.05e+00 | NA | -2.98e-01 | [1] | [2.66e-03] | 5.90e-01 | 0 | -7.36e-01 | false | false | 3.40e+01 | NA | 7.36e-01 | 7.99e-01 | -1.75e+01 | NA | "MQ" |
showing top 10 rows
In [9]:
mt
.
globals_table
()
.
show
()
describe¶
matrixtableの構造はdescribeを使うことで知ることができます。In [10]:
mt
.
describe
()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh38>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
ClippingRankSum: float64,
DP: int32,
DS: bool,
END: int32,
ExcessHet: float64,
FS: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
NEGATIVE_TRAIN_SITE: bool,
POSITIVE_TRAIN_SITE: bool,
QD: float64,
RAW_MQ: float64,
ReadPosRankSum: float64,
SOR: float64,
VQSLOD: float64,
VariantType: str,
culprit: str
}
----------------------------------------
Entry fields:
'AB': float64
'AD': array<int32>
'DP': int32
'GQ': int32
'GT': call
'MIN_DP': int32
'MQ0': int32
'PGT': call
'PID': str
'PL': array<int32>
'RGQ': int32
'SB': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
Filter¶
先ほどのshowで見てきたように、このデータには次のようなエントリが含まれていました。["AACCCT","A"]
["CT","TT","*","C"]
hailの便利な機能: テスト目的でのデータの絞り込み¶
前述したようにこのデータは圧縮した状態で1.3TBある巨大なものでした。 様々な関数を試していく上で全てのデータで計算をさせるのは非効率的です。 テストのために小さなデータを用意するのも手ではありますが、 hailにはこれに対応できるもっと便利な機能があります。 次のように実行することで、chr22のみのデータをtest_mtとして利用できるようになります。In [11]:
test_mt
=
mt
.
filter_rows
(
mt
.
locus
.
contig
==
'chr22'
)
In [12]:
test_mt
.
show
(
20
)
locus | alleles |
locus<GRCh38> | array<str> |
chr22:10510061 | ["A","T"] |
chr22:10510077 | ["C","A"] |
chr22:10510105 | ["T","A"] |
chr22:10510119 | ["A","G"] |
chr22:10510169 | ["G","A"] |
chr22:10510183 | ["A","G"] |
chr22:10510212 | ["A","T"] |
chr22:10510292 | ["T","C"] |
chr22:10510352 | ["AT","A"] |
chr22:10510353 | ["T","*","TA"] |
chr22:10510355 | ["AT","A"] |
chr22:10510356 | ["T","A","*"] |
chr22:10510424 | ["A","G"] |
chr22:10510445 | ["A","G"] |
chr22:10510450 | ["G","A"] |
chr22:10510530 | ["C","T"] |
chr22:10510532 | ["C","T"] |
chr22:10510545 | ["C","T","A"] |
chr22:10510569 | ["G","A"] |
chr22:10510597 | ["C","T"] |
showing top 20 rows
showing the first 0 of 2504 columns
複数候補があるものを分割する¶
上記の例ではallelsに複数の候補を持つもの(chr22:10510545)がありました。 これらを分割します。In [13]:
%%time
test_mt
=
hl
.
split_multi_hts
(
test_mt
)
CPU times: user 91.4 ms, sys: 0 ns, total: 91.4 ms
Wall time: 96.9 ms
内容を確認します。
In [14]:
%
time
test_mt.show(20)
locus | alleles |
locus<GRCh38> | array<str> |
chr22:10510061 | ["A","T"] |
chr22:10510077 | ["C","A"] |
chr22:10510105 | ["T","A"] |
chr22:10510119 | ["A","G"] |
chr22:10510169 | ["G","A"] |
chr22:10510183 | ["A","G"] |
chr22:10510212 | ["A","T"] |
chr22:10510292 | ["T","C"] |
chr22:10510352 | ["AT","A"] |
chr22:10510353 | ["T","TA"] |
chr22:10510355 | ["AT","A"] |
chr22:10510356 | ["T","A"] |
chr22:10510424 | ["A","G"] |
chr22:10510445 | ["A","G"] |
chr22:10510450 | ["G","A"] |
chr22:10510530 | ["C","T"] |
chr22:10510532 | ["C","T"] |
chr22:10510545 | ["C","A"] |
chr22:10510545 | ["C","T"] |
chr22:10510569 | ["G","A"] |
showing top 20 rows
showing the first 0 of 2504 columns
CPU times: user 118 ms, sys: 0 ns, total: 118 ms
Wall time: 3.72 s
1塩基置換に絞り込み¶
上記結果を見ると多塩基の置換も含まれています。これを1塩基置換のみに絞り込みます。In [15]:
%
time
test_mt = test_mt.filter_rows( \
((
test_mt
.
alleles
[
0
]
==
'A'
)
|
(
test_mt
.
alleles
[
0
]
==
'G'
)
|
\
(
test_mt
.
alleles
[
0
]
==
'C'
)
|
(
test_mt
.
alleles
[
0
]
==
'T'
))
&
\
((
test_mt
.
alleles
[
1
]
==
'A'
)
|
(
test_mt
.
alleles
[
1
]
==
'G'
)
|
\
(
test_mt
.
alleles
[
1
]
==
'C'
)
|
(
test_mt
.
alleles
[
1
]
==
'T'
)))
CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 10.2 ms
結果を確認します
1塩基置換のみが表示されます。
In [16]:
%
time
test_mt.show(20)
locus | alleles |
locus<GRCh38> | array<str> |
chr22:10510061 | ["A","T"] |
chr22:10510077 | ["C","A"] |
chr22:10510105 | ["T","A"] |
chr22:10510119 | ["A","G"] |
chr22:10510169 | ["G","A"] |
chr22:10510183 | ["A","G"] |
chr22:10510212 | ["A","T"] |
chr22:10510292 | ["T","C"] |
chr22:10510356 | ["T","A"] |
chr22:10510424 | ["A","G"] |
chr22:10510445 | ["A","G"] |
chr22:10510450 | ["G","A"] |
chr22:10510530 | ["C","T"] |
chr22:10510532 | ["C","T"] |
chr22:10510545 | ["C","A"] |
chr22:10510545 | ["C","T"] |
chr22:10510569 | ["G","A"] |
chr22:10510597 | ["C","T"] |
chr22:10510628 | ["G","A"] |
chr22:10510683 | ["T","A"] |
showing top 20 rows
showing the first 0 of 2504 columns
CPU times: user 127 ms, sys: 0 ns, total: 127 ms
Wall time: 2.78 s
これで一塩基置換へ絞り込むことができました。
今回は次のような操作を行いました。
- hailを起動し、1.3TBあるvcfファイルをmatrixtableに変換し、読み込み
- 読み込んだデータの閲覧
- matrixtableの構造を閲覧
- 複数候補のあるデータを1行1エントリに分割
- 1塩基置換へ絞り込み
In [ ]: