Amazon SageMaker および Amazon DocumentDB (MongoDB 互換性あり) で数百万のタンパク質のグラフニューラルネットをトレーニングする

プラトン再発行

フォロワー： 0

180,000D 構造が決定されたユニークなタンパク質は 3 を超えています。毎年何万もの新しい構造物が解決されています。これはほんの一部にすぎません特徴的な配列を持つ 200 億個の既知のタンパク質。最近の深層学習アルゴリズムには、アルファフォールドは、タンパク質の配列を使用してタンパク質の 3D 構造を正確に予測できるため、タンパク質の 3D 構造データを数百万単位に拡張するのに役立ちます。グラフニューラルネットワーク (GNN) は、アミノ酸残基のグラフで表現できるタンパク質構造から情報を抽出する効果的な深層学習アプローチとして登場しました。個々のタンパク質グラフには通常、数百のノードが含まれており、管理可能なサイズです。何万ものタンパク質グラフを、次のようなシリアル化されたデータ構造に簡単に保存できます。 TFレコード GNN のトレーニング用。ただし、何百万ものタンパク質構造に対して GNN をトレーニングするのは困難です。データのシリアル化は、テラバイト規模のデータセット全体をメモリに読み込む必要があるため、数百万のタンパク質構造に拡張できません。

この投稿では、に保存されている何百万ものタンパク質で GNN をトレーニングできる、スケーラブルな深層学習ソリューションを紹介します。 Amazon DocumentDB（MongoDB互換性あり）アマゾンセージメーカー.

説明の目的で、実験的に決定された公的に入手可能なタンパク質構造を使用します。タンパク質データバンクそして、そこからコンピューターによってタンパク質構造を予測します。 AlphaFoldタンパク質構造データベース。機械学習 (ML) の問題は、3D 構造から構築されたタンパク質グラフに基づいて実験構造と予測構造を区別するための識別子 GNN モデルを開発することです。

ソリューションの概要

まずタンパク質の構造を解析して、n 次元配列やネストされたオブジェクトなどの複数のタイプのデータ構造を含む JSON レコードを作成し、タンパク質の原子座標、プロパティ、識別子を保存します。タンパク質の構造の JSON レコードの保存には平均 45 KB かかります。 100 億個のタンパク質を保存するには約 4.2 TB が必要になると予測しています。 Amazon DocumentDB ストレージデータに合わせて自動的にスケールしますクラスターボリュームに 10 GB 単位で最大 64 TB まで追加できます。したがって、JSON データ構造とスケーラビリティのサポートにより、Amazon DocumentDB が自然な選択となります。

次に、構造から構築されたアミノ酸残基のグラフを使用してタンパク質の特性を予測するための GNN モデルを構築します。 GNN モデルは SageMaker を使用してトレーニングされ、データベースからタンパク質構造のバッチを効率的に取得するように構成されています。

最後に、トレーニングされた GNN モデルを分析して、予測についての洞察を得ることができます。

このチュートリアルでは次の手順を実行します。

を使用してリソースを作成します AWS CloudFormation テンプレート。
タンパク質の構造とプロパティを準備し、データを Amazon DocumentDB に取り込みます。
SageMaker を使用してタンパク質構造について GNN をトレーニングします。
トレーニングされた GNN モデルをロードして評価します。

この投稿で使用されているコードとノートブックは、次の場所から入手できます。 GitHubレポ.

前提条件

このチュートリアルでは、次の前提条件を満たしている必要があります。

このチュートリアルを 2.00 時間実行しても、費用は XNUMX ドル以下になります。

リソースを作成する

我々は CloudFormationテンプレートこの投稿に必要な AWS リソースを、投稿と同様のアーキテクチャで作成します。 Amazon SageMaker を使用した Amazon DocumentDB (MongoDB 互換) に保存されたデータの分析。 CloudFormation スタックの作成手順については、ビデオを参照してください。 AWS CloudFormation を使用してインフラストラクチャ管理を簡素化する.

CloudFormation スタックは以下をプロビジョニングします。

Amazon DocumentDB 用の 3 つのプライベートサブネットと、それぞれ SageMaker ノートブックインスタンスと ML トレーニングコンテナ用の 2 つのパブリックサブネットを持つ VPC。
各プライベートサブネットに 1 つずつ、合計 3 つのノードを持つ Amazon DocumentDB クラスター。
Amazon DocumentDB のログイン認証情報を保存するための Secrets Manager シークレット。これにより、平文の認証情報を SageMaker インスタンスに保存することを回避できます。
データを準備し、トレーニングジョブを調整し、対話型分析を実行するための SageMaker ノートブックインスタンス。

CloudFormation スタックを作成するときは、以下を指定する必要があります。

CloudFormation スタックの名前
Amazon DocumentDB のユーザー名とパスワード (Secrets Manager に保存される)
Amazon DocumentDB インスタンスタイプ (デフォルトは db.r5.large)
SageMaker インスタンスタイプ (デフォルト ml.t3.xlarge)

CloudFormation スタックの作成には約 15 分かかります。次の図は、リソースアーキテクチャを示しています。

タンパク質の構造とプロパティを準備し、データを Amazon DocumentDB に取り込みます

このセクションの後続のコードはすべて Jupyter ノートブックにあります Prepare_data.ipynb CloudFormation スタックで作成された SageMaker インスタンス内。

このノートブックは、タンパク質構造データを準備して Amazon DocumentDB に取り込むために必要な手順を処理します。

まず、予測されたタンパク質構造を以下からダウンロードします。アルファフォールド DB PDB 形式と、一致する実験構造タンパク質データバンク.

デモンストレーションの目的で、好熱性古細菌のタンパク質のみを使用します。メタノカルドコッカス・ジャンナスキーこれは、私たちが扱うことができる 1,773 個のタンパク質からなる最小のプロテオームを持っています。他の種のタンパク質を試してみるのも大歓迎です。

Secrets Manager に保存されている認証情報を取得して、Amazon DocumentDB クラスターに接続します。

def get_secret(stack_name): # Create a Secrets Manager client session = boto3.session.Session() client = session.client( service_name="secretsmanager", region_name=session.region_name ) secret_name = f"{stack_name}-DocDBSecret" get_secret_value_response = client.get_secret_value(SecretId=secret_name) secret = get_secret_value_response["SecretString"] return json.loads(secret) secrets = get_secret("gnn-proteins") # connect to DocDB uri = "mongodb://{}:{}@{}:{}/?tls=true&tlsCAFile=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" .format(secrets["username"], secrets["password"], secrets["host"], secrets["port"]) client = MongoClient(uri) db = client["proteins"] # create a database
collection = db["proteins"] # create a collection

Amazon DocumentDB への接続を設定した後、PDB ファイルを JSON レコードに解析してデータベースに取り込みます。

PDB ファイルの解析に必要なユーティリティ関数を提供します。 pdb_parse.pyを選択します。 parse_pdb_file_to_json_record この関数は、PDB ファイル内の 1 つまたは複数のペプチド鎖から原子座標を抽出するという重労働を実行し、1 つまたは JSON ドキュメントのリストを返します。これは、ドキュメントとして Amazon DocumentDB コレクションに直接取り込むことができます。次のコードを参照してください。

recs = parse_pdb_file_to_json_record(pdb_parser, pdb_file, pdb_id)
collection.insert_many(recs)

解析されたタンパク質データを Amazon DocumentDB に取り込んだ後、タンパク質ドキュメントの内容を更新できます。たとえば、タンパク質構造をトレーニング、検証、またはテストセットで使用する必要があるかどうかを示すフィールドを追加すると、モデルのトレーニングロジスティクスが簡単になります。

まず、フィールドを持つすべてのドキュメントを取得します。 is_AF 集約パイプラインを使用してドキュメントを階層化するには:

match = {"is_AF": {"$exists": True}}
project = {"y": "$is_AF"} pipeline = [ {"$match": match}, {"$project": project},
]
# aggregation pipeline
cur = collection.aggregate(pipeline)
# retrieve documents from the DB cursor
docs = [doc for doc in cur]
# convert to a data frame:
df = pd.DataFrame(docs)
# stratified split: full -> train/test
df_train, df_test = train_test_split( df, test_size=0.2, stratify=df["y"], random_state=42
)
# stratified split: train -> train/valid
df_train, df_valid = train_test_split( df_train, test_size=0.2, stratify=df_train["y"], random_state=42
)

次に、 update_many 分割情報を Amazon DocumentDB に保存する関数:

for split, df_split in zip( ["train", "valid", "test"], [df_train, df_valid, df_test]
): result = collection.update_many( {"_id": {"$in": df_split["_id"].tolist()}}, {"$set": {"split": split}}
)
print("Number of documents modified:", result.modified_count)

SageMaker を使用してタンパク質構造について GNN をトレーニングする

このセクションの後続のコードはすべて、 Train_and_eval.ipynb CloudFormation スタックで作成された SageMaker インスタンスのノートブック。

このノートブックは、Amazon DocumentDB に保存されているタンパク質構造データセットで GNN モデルをトレーニングします。

まず、Amazon DocumentDB からタンパク質ドキュメントのミニバッチを取得できるタンパク質データセット用の PyTorch データセットクラスを実装する必要があります。組み込みのプライマリ ID (_id).

を拡張して反復可能なスタイルのデータセットを使用します。 IterableDatasetをプリフェッチします。 _id 初期化時のドキュメントのラベル:

class ProteinDataset(data.IterableDataset): """ An iterable-style dataset for proteins in DocumentDB Args: pipeline: an aggregation pipeline to retrieve data from DocumentDB db_uri: URI of the DocumentDB db_name: name of the database collection_name: name of the collection k: k used for kNN when creating a graph from atomic coordinates """ def __init__( self, pipeline, db_uri="", db_name="", collection_name="", k=3 ): self.db_uri = db_uri self.db_name = db_name self.collection_name = collection_name self.k = k client = MongoClient(self.db_uri, connect=False) collection = client[self.db_name][self.collection_name] # pre-fetch the metadata as docs from DocumentDB self.docs = [doc for doc in collection.aggregate(pipeline)] # mapping document '_id' to label self.labels = {doc["_id"]: doc["y"] for doc in self.docs}

　 ProteinDataset でデータベース読み取り操作を実行します。 __iter__ 方法。複数のワーカーがある場合、ワークロードを均等に分割しようとします。

def __iter__(self): worker_info = torch.utils.data.get_worker_info() if worker_info is None: # single-process data loading, return the full iterator protein_ids = [doc["_id"] for doc in self.docs] else: # in a worker process # split workload start = 0 end = len(self.docs) per_worker = int( math.ceil((end - start) / float(worker_info.num_workers)) ) worker_id = worker_info.id iter_start = start + worker_id * per_worker iter_end = min(iter_start + per_worker, end) protein_ids = [ doc["_id"] for doc in self.docs[iter_start:iter_end] ] # retrieve a list of proteins by _id from DocDB with MongoClient(self.db_uri) as client: collection = client[self.db_name][self.collection_name] cur = collection.find( {"_id": {"$in": protein_ids}}, projection={"coords": True, "seq": True}, ) return ( ( convert_to_graph(protein, k=self.k), self.labels[protein["_id"]], ) for protein in cur )

上記 __iter__ このメソッドは、タンパク質の原子座標も次のように変換します。 DGLグラフ Amazon DocumentDB からロードされた後のオブジェクト convert_to_graph 関数。この関数は、C アルファ原子の 3D 座標を使用してアミノ酸残基の k 近傍 (kNN) グラフを構築し、残基の同一性を表すワンホットエンコードされたノード特徴を追加します。

def convert_to_graph(protein, k=3): """ Convert a protein (dict) to a dgl graph using kNN. """ coords = torch.tensor(protein["coords"]) X_ca = coords[:, 1] # construct knn graph from C-alpha coordinates g = dgl.knn_graph(X_ca, k=k) seq = protein["seq"] node_features = torch.tensor([d1_to_index[residue] for residue in seq]) node_features = F.one_hot(node_features, num_classes=len(d1_to_index)).to( dtype=torch.float ) # add node features g.ndata["h"] = node_features return g

ProteinDataset 実装すると、データセットのトレーニング、検証、テスト用にインスタンスを初期化し、トレーニングインスタンスをラップすることができます。 BufferedShuffleDataset シャッフルを有効にします。
さらにそれらを包みます torch.utils.data.DataLoader の他のコンポーネントを操作するには SageMaker PyTorch エスティメータートレーニングスクリプト。
次に、解釈を容易にするために、グローバルアテンションプーリング層を備えた単純な 2 層のグラフ畳み込みネットワーク (GCN) を実装します。

class GCN(nn.Module): """A two layer Graph Conv net with Global Attention Pooling over the nodes. Args: in_feats: int, dim of input node features h_feats: int, dim of hidden layers num_classes: int, number of output units """ def __init__(self, in_feats, h_feats, num_classes): super(GCN, self).__init__() self.conv1 = GraphConv(in_feats, h_feats) self.conv2 = GraphConv(h_feats, h_feats) # the gate layer that maps node feature to outputs self.gate_nn = nn.Linear(h_feats, num_classes) self.gap = GlobalAttentionPooling(self.gate_nn) # the output layer making predictions self.output = nn.Linear(h_feats, num_classes) def _conv_forward(self, g): """forward pass through the GraphConv layers""" in_feat = g.ndata["h"] h = self.conv1(g, in_feat) h = F.relu(h) h = self.conv2(g, h) h = F.relu(h) return h def forward(self, g): h = self._conv_forward(g) h = self.gap(g, h) return self.output(h) def attention_scores(self, g): """Calculate attention scores""" h = self._conv_forward(g) with g.local_scope(): gate = self.gap.gate_nn(h) g.ndata["gate"] = gate gate = dgl.softmax_nodes(g, "gate") g.ndata.pop("gate") return gate

その後、この GCN をトレーニングできます。 ProteinDataset タンパク質構造が AlphaFold によって予測されるかどうかを予測するバイナリ分類タスクのインスタンスです。目的関数としてバイナリクロスエントロピーを使用し、確率的勾配最適化には Adam オプティマイザーを使用します。完全なトレーニングスクリプトは次の場所にあります。 src/main.py.

次に、トレーニングジョブを処理するために SageMaker PyTorch Estimator をセットアップします。 SageMaker によって開始されたマネージド Docker コンテナが Amazon DocumentDB に接続できるようにするには、Estimator のサブネットとセキュリティグループを設定する必要があります。

サブネット ID を取得します。ネットワークアドレス変換 (NAT) ゲートウェイ存在するものと、Amazon DocumentDB クラスターの名前別のセキュリティグループ ID です。

ec2 = boto3.client("ec2")
# find the NAT gateway's subnet ID resp = ec2.describe_subnets( Filters=[{"Name": "tag:Name", "Values": ["{}-NATSubnet".format(stack_name)]}]
)
nat_subnet_id = resp["Subnets"][0]["SubnetId"]
# find security group id of the DocumentDB
resp = ec2.describe_security_groups( Filters=[{ "Name": "tag:Name", "Values": ["{}-SG-DocumentDB".format(stack_name)] }])
sg_id = resp["SecurityGroups"][0]["GroupId"]
Finally, we can kick off the training of our GCN model using SageMaker: from sagemaker.pytorch import PyTorch CODE_PATH = "main.py" params = { "patience": 5, "n-epochs": 200, "batch-size": 64, "db-host": secrets["host"], "db-username": secrets["username"], "db-password": secrets["password"], "db-port": secrets["port"], "knn": 4,
} estimator = PyTorch( entry_point=CODE_PATH, source_dir="src", role=role, instance_count=1, instance_type="ml.p3.2xlarge", # 'ml.c4.2xlarge' for CPU framework_version="1.7.1", py_version="py3", hyperparameters=params, sagemaker_session=sess, subnets=[nat_subnet_id], security_group_ids=[sg_id],
)
# run the training job:
estimator.fit()

トレーニングされた GNN モデルをロードして評価する

トレーニングジョブが完了したら、トレーニングされた GCN モデルをロードし、詳細な評価を実行できます。

次の手順のコードはノートブックにもあります。 Train_and_eval.ipynb.

SageMaker トレーニングジョブは、モデルアーティファクトをデフォルトの S3 バケットに保存します。このバケットの URI には、 estimator.model_data 属性。に移動することもできます。 トレーニングの仕事 SageMaker コンソールのページにアクセスして、評価するトレーニング済みモデルを見つけます。

研究目的で、モデルアーティファクト (学習したパラメータ) を PyTorch にロードできます。 state_dict 次の関数を使用します。

def load_sagemaker_model_artifact(s3_bucket, key): """Load a PyTorch model artifact (model.tar.gz) produced by a SageMaker Training job. Args: s3_bucket: str, s3 bucket name (s3://bucket_name) key: object key: path to model.tar.gz from within the bucket Returns: state_dict: dict representing the PyTorch checkpoint """ # load the s3 object s3 = boto3.client("s3") obj = s3.get_object(Bucket=s3_bucket, Key=key) # read into memory model_artifact = BytesIO(obj["Body"].read()) # parse out the state dict from the tar.gz file tar = tarfile.open(fileobj=model_artifact) for member in tar.getmembers(): pth = tar.extractfile(member).read() state_dict = torch.load(BytesIO(pth), map_location=torch.device("cpu"))
return state_dict state_dict = load_sagemaker_model_artifact(
bucket, key=estimator.model_data.split(bucket)[1].lstrip("/")
) # initialize a GCN model
model = GCN(dim_nfeats, 16, n_classes)
# load the learned parameters
model.load_state_dict(state_dict["model_state_dict"])

次に、精度を計算することにより、完全なテストセットに対して定量的モデル評価を実行します。

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
num_correct = 0
num_tests = 0
model.eval()
with torch.no_grad(): for batched_graph, labels in test_loader: batched_graph = batched_graph.to(device) labels = labels.to(device) logits = model(batched_graph) preds = (logits.sigmoid() > 0.5).to(labels.dtype) num_correct += (preds == labels).sum().item() num_tests += len(labels) print('Test accuracy: {:.6f}'.format(num_correct / num_tests))

GCN モデルは 74.3% の精度を達成しましたが、クラス事前確率に基づいて予測を行うダミーのベースラインモデルは 56.3% しか達成できなかったことがわかりました。

また、GCN モデルの解釈可能性にも関心があります。グローバルアテンションプーリングレイヤーを実装しているため、ノード全体のアテンションスコアを計算して、モデルによって行われた特定の予測を説明できます。

次に、注意スコアを計算し、同じペプチドからの構造のペア (AlphaFold 予測と実験) のタンパク質グラフに重ね合わせます。

pair = ["AF-Q57887", "1JT8-A"]
cur = collection.find( {"id": {"$in": pair}},
) for doc in cur: # convert to dgl.graph object graph = convert_to_graph(doc, k=4) with torch.no_grad(): # make prediction pred = model(graph).sigmoid() # calculate attention scores for a protein graph attn = model.attention_scores(graph) pred = pred.item() attn = attn.numpy() # convert to networkx graph for visualization graph = graph.to_networkx().to_undirected() # calculate graph layout pos = nx.spring_layout(graph, iterations=500) fig, ax = plt.subplots(figsize=(8, 8)) nx.draw( graph, pos, node_color=attn.flatten(), cmap="Reds", with_labels=True, font_size=8, ax=ax ) ax.set(title="{}, p(is_predicted)={:.6f}".format(doc["id"], pred))
plt.show()

上記のコードは、ノード上の注意スコアを重ねた次のタンパク質グラフを生成します。モデルのグローバル注意プーリング層は、タンパク質構造が AlphaFold によって予測されるかどうかを予測するために重要であるとして、タンパク質グラフ内の特定の残基を強調表示できることがわかりました。これは、これらの残基が予測タンパク質構造と実験タンパク質構造において特徴的なグラフトポロジーを持っている可能性があることを示しています。

要約すると、Amazon DocumentDB に保存されているタンパク質構造で GNN をトレーニングするためのスケーラブルな深層学習ソリューションを紹介します。このチュートリアルではトレーニングに数千のタンパク質のみを使用しますが、このソリューションは数百万のタンパク質に拡張可能です。タンパク質データセット全体をシリアル化するなどの他のアプローチとは異なり、私たちのアプローチはメモリを大量に使用するワークロードをデータベースに転送するため、トレーニングジョブのメモリが複雑になります。 O(batch_size)、トレーニングするタンパク質の総数には依存しません。

クリーンアップ

今後の料金発生を避けるために、作成した CloudFormation スタックを削除してください。これにより、VPC、Amazon DocumentDB クラスター、SageMaker インスタンスなど、CloudFormation テンプレートを使用してプロビジョニングしたすべてのリソースが削除されます。手順については、を参照してください。 AWSCloudFormationコンソールでスタックを削除する.

まとめ

私たちは、タンパク質構造を Amazon DocumentDB に保存し、SageMaker からデータのミニバッチを効率的に取得することで、数百万のタンパク質構造に拡張可能なクラウドベースの深層学習アーキテクチャについて説明しました。

タンパク質の特性予測における GNN の使用について詳しくは、最近の出版物をご覧ください。 LM-GVP、配列と構造からタンパク質の特性を予測するための一般化可能な深層学習フレームワーク.

著者について

王子辰博士号は、Amazon Machine Learning Solutions Lab の応用科学者です。生物学的データや医療データを使用した ML および統計手法の開発における数年間の研究経験を活かし、さまざまな業界の顧客と協力して ML の問題を解決しています。

セルバン・センティベル は、AWSのAmazon ML Solutions LabのシニアMLエンジニアであり、機械学習、ディープラーニングの問題、エンドツーエンドのMLソリューションについてお客様を支援することに重点を置いています。彼はAmazonComprehendMedicalの創設エンジニアリングリーダーであり、複数のAWSAIサービスの設計とアーキテクチャに貢献しました。