Here at Alloy, we use AWS Lambda functions to run batch operations and deploy data science models that are available on demand. AWS Lambda is a serverless compute service that allows developers to upload code and dependencies without having to provision or manage servers. Developers can trigger Lambda functions using Amazon CloudWatch or call them from applications.
This blog post is for data science teams looking to deploy their own Lambda functions requiring Pandas. Many good tutorials on using Lambda functions for data science are already out there, but this one focuses in particular on Lambda layers: an AWS tool released at the end of 2018. A layer is a ZIP archive that contains libraries, a custom runtime, or other dependencies and can be configured with your Lambda function.
By deploying the dependencies manually as a Lambda layer versus deploying our code with our dependencies each time, we’ve benefited by:
- Using the layer to create efficiencies by keeping code deployment small,
- Seeing and modifying code in the AWS Lambda console,
- Sharing Layers between multiple functions to easily load commonly used dependencies, and
- Using less DevOps time as we haven’t needed an EC2 instance to compile Pandas in an AWS Linux environment.
This post will explain how to manually deploy a Lambda layer which will contain Pandas compiled in an AWS Linux environment, along with other dependencies.
Deploying the Lambda Layer
Lambda layers are a feature of AWS Lambda functions used to store dependencies that can be accessed by Lambda functions. Data scientists frequently find themselves importing Pandas and NumPy as dependencies for data science and modeling work. While AWS provides a Python NumPy and SciPy layer, it doesn’t include Pandas. Since our function uses Pandas, we could not use the AWS-provided layer.
Much of this post draws from inspiration from a similar blog post by Quy Tang. However, instead of using the AWS-provided Python NumPy layer and supplementing it as Quy did, we opted to build all the dependencies together in one layer.
Deploying the layer separately from the code creates efficiencies in keeping actual code deployment small. This allows us to deploy much faster, as we don’t have to upload a ~40MB (before being zipped!) Pandas dependency during each deployment.
As a Senior Data Engineer at Alloy, I’ve worked on scheduling batch jobs in AWS Lambda. Using layers has given me the added bonus of seeing my code in the Lambda console. Seeing the code inline is great for making quick changes to the code while testing through configured test events in the console. After testing in the console, I finalize the code in GitHub.
The downside to deploying code and dependencies separately is that there could be some misalignment — so be careful! To avoid misalignments, we would suggest deploying a layer each time there is a change in requirements.txt and maintaining clear naming conventions for layers.
1. Creating the Requirements File
For this tutorial, here’s a basic requirements file including Pandas, NumPy, and a dependency:
We have requirements.txt sitting in our repository, which looks like this:
Here blog-lambda-function is one of the many Lambda functions in our repository. blog-lambda-function/requirements.txt is specific to this Lambda function so it will sit in that folder.
2. Creating the get_layer_packages Script
Next we will create get_layer_packages.sh which sits in the same folder as requirements.txt for that specific Lambda function.
The script below will download the dependencies for packages commonly used in Python data science locally. It will compile all of the requirements in a Linux AWS environment to a local folder called python. It’s very important to get Pandas compiled correctly in an Linux AWS environment; otherwise, the Lambda function will fail to load the dependency.
Installing the dependencies under the python folder is required for Lambda.
Note: I used Python3.6, but there are Python 3.7 and 3.8 Docker images as well.
3. Create and Zip the Lambda Layer folder
Now we make get_layer_packages.sh executable and run it.
Running this script will create a python file under the blog-lambda-function folder which has the dependencies compiled in a Linux AWS environment. The other option to get Pandas to work in an AWS Lambda environment is to compile Pandas on an EC2 instance. That would require more DevOps time, so we chose not to go that route.
Now, zip up the python folder and ignore unnecessary files in the zipped file, such as requirements.txt. We descriptively named this zipped file my-Python36-Layer-Deployment.zip:
This command only selects the contents of the python folder to zip up. Unzipping this file locally should reveal a python folder with properly compiled requirements inside.
The file directory looks like this at the end:
4. Send it to AWS Lambda
The last step is to send this up! You can do this after configuring the AWS Command Line Interface (AWS CLI) using this AWS guide.
Running command in the Terminal will send the layer up to AWS (Change compatible-runtime and layer name):
Pick a layer name similar to your function name to find it easily. Note that publishing to the same layer will increment the layer version number by 1.
Lastly, navigate to the “Designer View” to assign the Layer to the function.
In the following post, we’ll describe how to deploy the handler function of a Lambda function using GitHub Actions. Even without using GitHub Actions, you can still utilize the Lambda layer with Pandas. Manually deploy your Lambda handler function through uploading a zipped or S3 file or edit the code inline in the UI.