Configure Amazon EMR Studio And Amazon EKS To Run Notebooks With Amazon EMR On EKS

Republicat de Platon

Urmaritori: 0

Amazon EMR pe Amazon EKS oferă o opțiune de implementare pentru Amazon EMR that allows you to run analytics workloads on Serviciul Amazon Elastic Kubernetes (Amazon EKS). This is an attractive option because it allows you to run applications on a common pool of resources without having to provision infrastructure. In addition, you can use Amazon EMR Studio to build analytics code running on Amazon EKS clusters. EMR Studio is a web-based, integrated development environment (IDE) using fully managed Jupyter notebooks that can be attached to any EMR cluster, including EMR on EKS. It uses Conectare unică AWS (SSO) or a compatible identity provider (IdP) to log directly in to EMR Studio through a secure URL using corporate credentials.

Deploying EMR Studio to attach to EMR on EKS requires integrating several AWS services:

In addition, you need to install the following EMR on EKS components:

This post helps you build all the necessary components and stitch them together by running a single script. We also describe the architecture of this setup and how the components work together.

Privire de ansamblu asupra arhitecturii

With EMR on EKS, you can run Spark applications alongside other types of applications on the same Amazon EKS cluster, which improves resource allocation and simplifies infrastructure management. For more information about how Amazon EMR operates inside an Amazon EKS cluster, see Nou – Amazon EMR pe Amazon Elastic Kubernetes Service (EKS). EMR Studio provides a web-based IDE that makes it easy to develop, visualize, and debug applications that run in EMR. For more information, see Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR.

Spark kernels are scheduled pods in a namespace in an Amazon EKS cluster. EMR Studio uses Jupyter Enterprise Gateway (JEG) to launch Spark kernels on Amazon EKS. A managed endpoint of type JEG is provisioned as a Kubernetes deployment in the EMR virtual cluster’s associated namespace and exposed as a Kubernetes service. Each EMR virtual cluster maps to a Kubernetes namespace registered with the Amazon EKS cluster; virtual clusters don’t manage physical compute or storage, but point to the Kubernetes namespace where the workload is scheduled. Each virtual cluster can have several managed endpoints, each with their own configured kernels for different use cases and needs. JEG managed endpoints provide HTTPS endpoints, serviced by an Application Load Balancer (ALB), that are reachable only from EMR Studio and self-hosted notebooks that are created within a private subnet of the Amazon EKS VPC.

Următoarea diagramă ilustrează arhitectura soluției.

The managed endpoint is created in the virtual cluster’s Amazon EKS namespace (in this case, sparkns) and the HTTPS endpoints are serviced from private subnets. The kernel pods run with the job-execution IAM role defined in the managed endpoint. During managed endpoint creation, EMR on EKS uses the AWS Load Balancer Controller in the kube-system namespace to create an ALB with a target group that connects with the JEG managed endpoint in the virtual cluster’s Kubernetes namespace.

You can configure each managed endpoint’s kernel differently. For example, to permit a Spark kernel to use AWS Adeziv as their catalog, you can apply the following configuration JSON file in the —configuration-overrides flag when creating a managed endpoint:

aws emr-containers create-managed-endpoint --type JUPYTER_ENTERPRISE_GATEWAY --virtual-cluster-id ${virtclusterid} --name ${virtendpointname} --execution-role-arn ${role_arn} --release-label ${emr_release_label} --certificate-arn ${certarn} --region ${region} --configuration-overrides '{ "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", "spark.sql.catalogImplementation": "hive" } } ] }'

The managed endpoint is a Kubernetes deployment fronted by a service inside the configured namespace (in this case, sparkns). When we trace the endpoint information, we can see how the Jupyter Enterprise Gateway deployment connects with the ALB and the target group:

# Get the endpoint ID
aws emr-containers list-managed-endpoints --region us-east-1 --virtual-cluster-id idzdhw2qltdr0dxkgx2oh4bp1
{ "endpoints": [ { "id": "5vbuwntrbzil1", "name": "virtual-emr-endpoint-demo", ... "serverUrl": "https://internal-k8s-default-ingress5-4f482e2d41-2097665209.us-east-1.elb.amazonaws.com:18888", # List the deployment
kubectl get deployments -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1" NAME READY UP-TO-DATE AVAILABLE AGE
jeg-5vbuwntrbzil1 1/1 1 1 4h54m # List the service
kubectl get svc -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1" NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service-5vbuwntrbzil1 NodePort 10.100.172.157 <none> 18888:30091/TCP 4h58m # List the TargetGroups to get the TargetGroup ARN kubectl get targetgroupbinding -n sparkns -o json | jq .items | jq .[].spec.targetGroupARN "arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8" # Get the TargetGroup Port number aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].Port 30091 # Get Load Balancer ARN aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].LoadBalancerArns | jq .[] "arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273" # Get Listener Port number aws elbv2 describe-listeners --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273 | jq .Listeners | jq .[].Port 18888

To look at how this connects, consider two EMR Studio sessions. The ALB exposes port 18888 to the EMR Studio sessions. The JEG service maps the external port 18888 on the ALB to the dynamic NodePort on the JEG service (in this case, 30091). The JEG service forwards the traffic to the TargetPort 9547, which routes the traffic to the appropriate Spark driver pod. Each notebook session has its own kernel, which has its own respective Spark driver and executor pods, as the following diagram illustrates.

Attach EMR Studio to a virtual cluster and managed endpoint

Each time a user attaches a virtual cluster and a managed endpoint to their Studio Workspace and launches a Spark session, Spark drivers and Spark executors are scheduled. You can see that when you run kubectl to check what pods were launched:

$ kubectl get all -l app=enterprise-gateway
NAME READY STATUS RESTARTS AGE
pod/kb1a317e8-b77b-448c-9b7d-exec-1 1/1 Running 0 2m30s
pod/kb1a317e8-b77b-448c-9b7d-exec-2 1/1 Running 0 2m30s
pod/kb1a317e8-b77b-448c-9b7d-driver 2/2 Running 0 2m38s $ kubectl get pods -n sparkns
NAME READY STATUS RESTARTS AGE
jeg-5vbuwntrbzil1-5fc8469d5f-pfdv9 1/1 Running 0 3d7h
kb1a317e8-b77b-448c-9b7d-exec-1 1/1 Running 0 2m38s
kb1a317e8-b77b-448c-9b7d-exec-2 1/1 Running 0 2m38s
kb1a317e8-b77b-448c-9b7d-driver 2/2 Running 0 2m46s

Each notebook Spark kernel session deploys a driver pod and executor pods that continue running until the kernel session is shut down.

The code in the notebook cells runs in the executor pods that were deployed in the Amazon EKS cluster.

Set up EMR on EKS and EMR Studio

Several steps and pieces are required to set up both EMR on EKS and EMR Studio. Enabling AWS SSO is a prerequisite. You can use the two provided launch scripts in this section or manually deploy it using the steps provided later in this post.

We provide two launch scripts in this post. One is a bash script that uses Formarea AWS Cloud, eksctl, and Interfața liniei de comandă AWS (AWS CLI) commands to provide an end-to-end deployment of a complete solution. The other uses the Kit AWS Cloud Development (AWS CDK) to do so.

The following diagram shows the architecture and components that we deploy.

Cerințe preliminare

Make sure to complete the following prerequisites:

For information about the supported IdPs, see Enable AWS Single Sign-On for Amazon EMR Studio.

Script Bash

Scriptul este disponibil pe GitHub.

Cerințe preliminare

The script requires you to use AWS Cloud9. Follow the instructions in the Amazon EKS Workshop. Make sure to follow these instructions carefully:

After you deploy the AWS Cloud9 desktop, proceed to the next steps.

Pregătire

Use the following code to clone the GitHub repo and prepare the AWS Cloud9 prerequisites:

# Download script from the repository
$ git clone https://github.com/aws-samples/amazon-emr-on-eks-emr-studio.git # Prepare the Cloud9 Desktop pre-requisites
$ cd amazon-emr-on-eks-emr-studio
$ bash ./prepare_cloud9.sh

Implementați stiva

Before running the script, provide the following information:

The AWS account ID and Region, if your AWS Cloud9 desktop isn’t in the same account ID or Region where you want to deploy EMR on EKS
Numele Serviciul Amazon de stocare simplă (Amazon S3) bucket to create
The AWS SSO user to be associated with the EMR Studio session

After the script deploys the stack, the URL to the deployed EMR Studio is displayed:

# Launch the script and follow the instructions to provide user parameters
$ bash ./deploy_eks_cluster_bash.sh ...
Go to https://***. emrstudio-prod.us-east-1.amazonaws.com and login using < SSO user > ...

AWS CDK script

The AWS CDK scripts are available on GitHub. You need to checkout the main branch. The stacks deploy an Amazon EKS cluster and EMR on EKS virtual cluster in a new VPC with private subnets, and optionally an Amazon Managed Apache Airflow (Amazon MWAA) environment and EMR Studio.

Cerințe preliminare

You need the AWS CDK version 1.90.1 or higher. For more information, see Noțiuni introductive cu AWS CDK.

We use a prefix list to restrict access to some resources to network IP ranges that you approve. Create a prefix list dacă nu aveți deja unul.

If you plan to use EMR Studio, you need AWS SSO configured in your account.

Pregătire

After you clone the repository and checkout the main branch, create and activate a new Python virtual environment:

# Clone the repository
$ git clone https://github.com/aws-samples/aws-cdk-for-emr-on-eks.git
$ cd aws-cdk-for-emr-on-eks/
$ git checkout main # $ python3 -m venv .venv
$ source .venv/bin/activate

Now install the Python dependencies:

$ pip install -r requirements.txt

Lastly, bootstrap the AWS CDK:

$ cdk bootstrap aws://<account>/<region> --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name>

Deploy the stacks

Synthesize the AWS CDK stacks with the following code:

$ cdk synth --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name>

This command generates four stacks:

emr-eks-cdk – The main stack
mwaa-cdk – Adds Amazon MWAA
studio-cdk – Adds EMR Studio prerequisites
studio-cdk-live – Adds EMR Studio

The following diagram illustrates the resources deployed by the AWS CDK stacks.

Start by deploying the first stack:

$ cdk deploy <stack name> --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name> emr-eks-cdk

If you want to use Apache Airflow as your orchestrator, deploy that stack:

$ cdk deploy <stack name> --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name> mwaa-cdk

Deploy the first EMR Studio stack:

$ cdk deploy <stack name> --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name> studio-cdk

Wait for the managed endpoint to become active. You can check the status by running the following code:

$ aws emr-containers list-managed-endpoints --virtual-cluster-id <cluster ID> | jq '.endpoints[].state'

The virtual cluster ID is available in the AWS CDK output from the emr-eks-cdk stack.

When the endpoint is active, deploy the second EMR Studio stack:

$ cdk deploy <stack name> --context prefix=<prefix list> --context instance=m5.xlarge --context username=<SSO user name> studio-live-cdk

Manual deployment

If you prefer to manually deploy EMR on EKS and EMR Studio, use the steps in this section.

Set up a VPC

If you’re using Amazon EKS v. 1.18, set up a VPC that also has private subnets and appropriately tagged for external load balancers. For tagging, see: Application load balancing on Amazon EKS și Create an EMR Studio service role.

Creați un cluster Amazon EKS

Launch an Amazon EKS cluster with at least one managed node group. For instructions, see Configurare și Getting Started with Amazon EKS.

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

To create your IAM policies, roles, IdP, and SSL/TLS certificate, complete the following steps:

Enable cluster access for EMR on EKS.
Create an IdP in IAM based on the EKS OIDC provider URL.
Create an SSL/TLS certificate and place it in Manager certificat AWS.
Create the relevant IAM policies and roles:
1. Rolul de executare a jobului
2. Update the trust policy for the job execution role
3. Deploy and create the IAM policy for the AWS Load Balancer Controller
4. EMR Studio service role
5. EMR Studio user role
6. EMR Studio user policies associated with AWS SSO users and groups
Înregistrați clusterul Amazon EKS la Amazon EMR to create the virtual EMR cluster
Create the appropriate grupuri de securitate to be attached to each EMR Studio created:
1. Grup de securitate pentru spațiul de lucru
2. Engine security group
Tag the security groups with the appropriate tags. For instructions, see Create an EMR Studio service role.

Required installs in Amazon EKS

Implementați fișierul Controller AWS Load Balancer in the Amazon EKS cluster if you haven’t already done so.

Create EMR on EKS relevant pieces and map the user to EMR Studio

Urmați pașii următori:

Create at least one EMR virtual cluster associated with the Amazon EKS cluster. For instructions, see Step 1 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one managed endpoint. For instructions, see Step 2 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one EMR Studio; associate the EMR Studio with the private subnets configured with the Amazon EKS cluster. For instructions, see Creați un EMR Studio.
When the EMR Studio is available, map an AWS SSO user or group to the EMR Studio and apply an appropriate IAM policy to that user.

Use EMR Studio

To start using EMR Studio, complete the following steps:

Find the URL for EMR Studio by the studios in a Region:

$ aws emr list-studios --region us-east-1
{ "Studios": [ { "StudioId": "es-XXXXXXXXXXXXXXXXXXXXXX", "Name": "emr_studio_1", "VpcId": "vpc-XXXXXXXXXXXXXXXXXXXX", "Url": "https://es-XXXXXXXXXXXXXXXXXXXXXX.emrstudio-prod.us-east-1.amazonaws.com", "CreationTime": "2021-02-10T14:04:13.672000+00:00" } ]
}

With the listed URL, log in using the AWS SSO username you used earlier.

After authentication, the user is routed to the EMR Studio dashboard.

Alege Creați spațiu de lucru.
Pentru Numele spațiului de lucru, introduceți un nume.
Pentru Subrețea, choose the subnet that corresponds to one of the subnets associated with the managed node group.
Pentru Locația S3, enter an S3 bucket where you can store the notebook content.

After you create the Workspace, choose one that is in the Ready stare.

In the sidebar, choose the EMR cluster icon.
În Tipul clusterului¸ alege Cluster EMR pe EKS.
Choose the available virtual cluster and available managed endpoint.
Alege Atașa.

After it’s attached, EMR Studio displays the kernels available in the Blocnotes și Consoleze secţiune.

Alege PySpark (Kubernetes) to launch a notebook kernel and start a Spark session.

Because the endpoint configuration here uses AWS Glue for its metastore, you can list the databases and tables connected to the AWS Glue Data Catalog. You can use the following example script to test the setup. Modify the script as necessary for the appropriate database and table that you have in your Data Catalog:

words='Welcome to Amazon EMR Studio'.split(' ')
wordRDD = sc.parallelize(words)
wc = wordRDD.map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)
print(wc.collect()) # Connect to Glue Catalog
spark.sql("""show databases like '< Database Name >'""").show(truncate=False)
spark.sql("""show tables in < Database Name >""").show(truncate=False)
# Run a simple select
spark.sql("""select * from < Database Name >.< Table Name > limit 10""").show(truncate=False)

A curăța

To avoid incurring future charges, delete the resources launched here by running remove_setup.sh:

# Launch the script
$ bash ./remove_setup.sh</p>

Concluzie

EMR on EKS allows you to run applications on a common pool of resources inside an Amazon EKS cluster without having to provision infrastructure. EMR Studio is a fully managed Jupyter notebook and tool that provisions kernels that run on EMR clusters, including virtual clusters on Amazon EKS. In this post, we described the architecture of how EMR Studio connects with EMR on EKS and provided scripts to automatically deploy all the components to connect the two services.

Dacă aveți întrebări sau sugestii, vă rugăm să lăsați un comentariu.

Despre Autori

Randy DeFauw is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

Matthew Tan este arhitect senior de soluții de analiză la Amazon Web Services și oferă îndrumări clienților care dezvoltă soluții cu servicii AWS Analytics pentru sarcinile lor de analiză.

Source: https://aws.amazon.com/blogs/big-data/configure-amazon-emr-studio-and-amazon-eks-to-run-notebooks-with-amazon-emr-on-eks/

Timestamp-ul: 24 Septembrie, 2021

Timestamp-ul: August 31, 2021

În ianuarie 2022: o experiență de conectare Amazon QuickSight actualizată

Cluster sursă:

AWS

Nodul sursă: 1876560

Timestamp-ul: Septembrie 28, 2021

Automatizați gestionarea utilizatorilor și a grupurilor Amazon QuickSight folosind date LDAP pentru securitate la nivel de rând

Cluster sursă:

AWS

Nodul sursă: 1052621

Timestamp-ul: August 19, 2021

Stabiliți conectivitate privată între Amazon QuickSight și Snowflake folosind AWS PrivateLink

Cluster sursă:

AWS

Nodul sursă: 1858655

Timestamp-ul: Iulie 22, 2021

Configurați Amazon EMR Studio și Amazon EKS pentru a rula notebook-uri cu Amazon EMR pe EKS

Republicat de Platon

Privire de ansamblu asupra arhitecturii

Attach EMR Studio to a virtual cluster and managed endpoint

Set up EMR on EKS and EMR Studio

Cerințe preliminare

Script Bash

Cerințe preliminare

Pregătire

Implementați stiva

AWS CDK script

Cerințe preliminare

Pregătire

Deploy the stacks

Manual deployment

Set up a VPC

Creați un cluster Amazon EKS

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

Required installs in Amazon EKS

Create EMR on EKS relevant pieces and map the user to EMR Studio

Use EMR Studio

A curăța

Concluzie

Despre Autori

Mai mult de la AWS

Funcții noi de la Apache Hudi 0.7.0 și 0.8.0 disponibile pe Amazon EMR

Interogați o bază de date Teradata utilizând Amazon Athena Federated Query și alăturați-vă cu datele din lacul dvs. de date Amazon S3

Simplificați asimilarea datelor primite cu seturi de date parametrizate dinamic în AWS Glue DataBrew

Începeți cu Amazon Redshift Data API

În ianuarie 2022: o experiență de conectare Amazon QuickSight actualizată

Stabiliți conectivitate privată între Amazon QuickSight și Snowflake folosind AWS PrivateLink

Despre noi

Căutare verticală și Ai

Platformă

Rămâneți conectat

Cont