I have a DynamoDB table that is automatically exported as JSON to compute some reports. I wanted to automate the table creation process and load steps. Amazon Athena allows querying from raw files stored on S3, which allows reporting when a full database would be too expensive to run because it’s reports are only needed a low percentage of the time or a full database is not required. It’s billed by the amount of data scanned, which makes it relatively cheap for my use case.
I initially used AWS Glue to create the underlying export table schema. I then copied and edited the schema definition it created to be a format string that takes some parameters.
One might think that it’s possible to SQL inject this, and one would be right if there was any user accessible inputs, but all inputs come from trusted upstream sources, like S3 events and information derived from the DynamoDB export metadata.
The next step is to use the start_query_execution
from boto3 to
send the various queries, and get_query_execution
to check on the
status of the execution. It’s important to note that
start_query_execution
is an asynchronous api, so we need to poll to
determine when it is completed. Generally a DDL query is relatively
quick (typically 1 to 2 seconds), so it’s okay to poll inside of a
lambda function.
We are going to create 2 tables. One will be the actual export table item, and the other will be a view with some cleaned up results.
The above code issues a single Athena query, and checks the status of the execution with a short wait between polls. It raises an exception that the lambda function fails and the errors will show up in CloudWatch where various alarms can fire off. You’ll also notice the lack of a proper lambda handler, because it’s tailored to my use case. Yours will probably look a different from mine, as I use S3 events from the DynamoDB export to trigger the view updates.
The next step is to configure the CloudFormation for creating the lambda function and giving the proper permissions. The permissions for Athena are a bit weird, as we need to touch a lot of different services. We need to allow GETs and PUTs into the results location in S3. A couple of glue functions must also be named. I like to declare these permissions inline in my CloudFormation, instead of using a full blown policy object.
All permissions were determined through trial and error to get it to work. You’ll notice that the ARNs have been masked to prevent leaking sensitive information and there’s a couple of wild cards. The masked values are things like AWS Account IDs, and where you want the Athena results to live. It’s important to note that if the underlying export tables need to be protected the IAM policy statements should be split into the create and update portions. Although if your view does not already exist you may need to include the create permission in the update portion.
The ARN of an athena workgroup that will process the queries must be
specified for the athena:StartQueryExecution
, otherwise the
start_query_execution
calls will not work. The region part of the ARN
must be a wildcard. The s3:GetBucketLocation
permission is required
for athena to determine what region a given bucket lives in, and how it
wants to get the required files. That permission needs to be in a
separate IAM block because it’s allowed to operate on any s3 resource,
and might need to. To lock it down, just specify it as a specific s3
bucket.
I hope you found this technical tutorial helpful. These are mostly my notes on how to make this work. One might need to go through additional steps depending on if the account has various permissions and restrictions configured.
UPDATE Feb 9, 2021 Not all required permissions were listed. The post was updated with all additional permissions and details on the required resources.