Lessons Learned: Building a Customized Server-less AWS Grafana Dashboard

My name is Adam Norris, I’m a Senior Solutions Architect at RIVA. We recently kicked off a new project with the United States Army who challenged us to design a robust and visually pleasing way to display metrics, migrating them away from manual Excel spreadsheets. We love a good challenge, so we set out with the intention of achieving a total transformation of how contract deliverables are gathered and presented. In this blog, I detail the approach RIVA took to create a server-less backend and streamlined solution to gather metrics, displayed on a Grafana dashboard.

This story is not about how "it worked the first time because I knew exactly what I was doing." It is more of a "wonderful journey of trial and error" combined with a healthy dose of reminding myself to "not let perfect be the enemy of good." We learned a lot in the initial prototyping of this project and I’m excited to say, this powerful insight dashboard will one day become a new offering available to anyone looking to break free from Excel spreadsheets. I hope you find these lessons learned and insights valuable.

Setting Expectations and Getting Started

I challenged myself to the high standard of building server-less AWS. The easy way would have been to make this a simple EC2 server, sitting out in the distance doing things, but the benefits of limitless scaling and not having to manage server infrastructure, amongst other things, makes the challenge worth it. On top of the server-less challenge, I upped the ante by creating the infrastructure in terraform and all Lambda coding in python.

From the outset, it was business as usual in my terraform scripts. They started out with the standard VPC, subnet, S3 bucket (random name of course), and IAM. From there, I created the Docker images file on my local WSL instance, using Docker Desktop, to configure and perfect the images as much as I could before uploading them to ECR within a sandbox AWS account. Grafana and InfluxDB were not new to me, I use them personally to monitor my Wi-Fi statistics and all of my Raspberry PI setups in my home. I created each image definition in Docker files initially, then refactored them into Docker Compose.

Tackling New Challenges

Once all of the straightforward, familiar stuff was out of the way, it was onto new challenges. While I had used ECR and ECS in a previous life, I had never created and configured them from a blank slate. Then, there was the added complexity of creating Lambda in python.

I started by creating the Docker repository in ECR to house a Grafana and InfluxDB instance using the Docker images I baked and tested on my local machine. It worked out of the box once I authenticated from the AWSCLI command line. I made the conscious choice to use my local machine as the Docker push source so I could have more positive control over the scripting necessary to build and push.

After these early victories, I moved on to create the ECS cluster to run the tasks. This is where I encountered my first major challenge. Since I had not created this before, I used the tried-and-true method (at least for me) of creating and getting each facet of it working by using the AWS interface. Looking between the attributes expected from the AWS UI and the Terraform documentation allows some level of parity. After creating the service or sub-service I needed and configuring it to work as desired, I would take many iterations of code, terraform destroy, terraform apply, (why isn't this working?), configure again, lather, rinse, and repeat. I was at the point where all the services and components were being generated, but ECS was not recognizing the EC2 instances in the scaling group as valid hosts for containers.

Diagnosing the Issue

I was able to diagnose the issue with a combination of AWS logging and actually ssh-ing into the EC2 instance and looking at the logs. Once I diagnosed the problem as a networking issue, I reverted the networking to use the default VPC instance. Finally, success! I now had an ECS cluster up and running with EC2 in an auto-scaling group using Terraform. After that, I create the task definitions using the Docker images I pushed to ECR. It was then finally running where I could hit the Grafana instance on port 3000 and log in as admin using the password I defined during image creation.

With all infrastructure complete, I moved on to the Lambda and all necessary events to actually process the CSV. The general concept I used was the creation of an event on an S3 bucket to invoke a Python Lambda script for importing the file. This where I learned about Lambda Layers and all nuances associated with them. Each of my references, including pandas and requests, needed to also be part of the terraform automation - and they needed to be in the correct folder structure usage. After some research, I figured out I could go to pipy.org to download the necessary libraries. I would then extract, combine, and re-package them. All new packaging was added as zip files to the ever-growing Terraform scripts and payload being created.

Once all of my Python reference libraries were in place within the Terraform script, I created the initial script to read the CSV from the S3 bucket and all events required to read and process the file when it was introduced into the bucket. Once this script was complete, it sent a notification to a Slack channel to alert us that an event happened and reported how many entries were processed into the database. This script evolved from doing the push of data using a simple HTTP post as an action for each entry (which was of course, very slow), to a multi-part post for the entire payload of data, to using the actual InfluxDB libraries.

A Proof of Concept is Born

As mentioned in the opening, this is not an outline of a first-time successful effort; there was a lot of learning, trial, and error. At RIVA, we are committed to delivering innovative solutions to our customers, which is how this customized approach was born. In the end, a capability to consume a CSV in known format into an InfluxDB instance using an AWS server-less infrastructure, for eventual display in the provided Grafana instance was created and is now available for the Army to better visualize how their statistics are affecting their business processes. This project is just beginning. Now that the proof of concept has been completed and successfully demonstrated, I intend to extend the capabilities to include:

An AWS API Gateway endpoint to allow for full file upload, RPA integration, and better security
A full Grafana provision (e.g., users, teams, dashboards, data sources) using the Terraform provider
A fully immutable environment using a GitOps framework through a robust CI/CD pipeline

I hope you found these insights valuable. Please reach out if you have questions or what to chat about how RIVA can deploy this technology at your organization.

Lessons Learned: Building a Customized Server-less AWS Grafana Dashboard

Leave Comment

Most Popular

Quick Links