AWS Cost Optimization and CI/CD Automation for Entertainment Software Platform

Client Background

Our client is a platform that connects artists from all over the globe with music curators by promoting songs and artists. At the same time, allowing curators to find talented artists and earn exciting rewards by exploring new music hits.

Business Challenge

The main business challenges were as follows:

1. Lack of transparency of processes in the team

The team faced difficulties maintaining project visibility and the decisions’ accuracy.

2. Not a stable product with frequent downtimes

The application experienced frequent downtimes, causing inconvenience to users and negatively impacting the overall user experience.

3. Postponed release schedule

The release schedule of the application had to be postponed multiple times due to various reasons, such as unforeseen technical issues, resource constraints, or changing business requirements. This delay affected the timely delivery of new features and hindered the ability to meet user expectations.

4. Vast and non-optimal infrastructure costs

The infrastructure costs associated with hosting and managing the application were high and not optimized. Inefficient resource allocation and improper scaling mechanisms led to unnecessary expenses, impacting the overall profitability of the project.

5. Project maintenance resource disposal

There were challenges in efficiently allocating and managing project maintenance resources to ensure the right resources were available at the right time.

6. Not scalable application

Infrastructure architecture was not adapted to potential scale-up or down activities. It could have an impact on platform performance, availability, and self-healing.

Regarding the technology stack, the main challenges were as follows:

The application was developed using Node.js (Express) technology, an open-source server-side and mobile API application framework (as Node.js provides high performance for data-intensive applications that require real-time processing capabilities).

The application consists of 2 main parts: the backend and a cron-server for running periodical tasks or launching routine background tasks. The backend and cron-server are hosted and deployed on AWS (Amazon Web Services).

The front end of the application is hosted as static content on AWS S3. This approach leverages the simplicity and cost-effectiveness of hosting static content on a scalable and globally distributed storage service like AWS S3.

Prior to collaborating with the Gart team of DevOps experts, the client’s solution was hosted on multiple EC2 machines and managed manually. The client had already chosen AWS as their hosting platform, and the Gart team stepped in to optimize and improve the deployment and management processes.

The manual deployment process involved developers accessing the target server via SSH, fetching a branch from the repository, and activating the build. However, due to the lack of isolation between the development environment and production, the Amazon EC2 machines were placed in the same account as a virtual private cloud (VPC). This setup posed security and stability risks as changes made in the development environment had an impact on the production environment.

Our customer had 4 different Amazon EC2 instances; each required the manual deployment procedure described earlier. This distributed workload increased complexity, maintenance efforts, and the costs of managing multiple server instances. Consolidating the deployment process and reducing the number of server instances would help streamline operations and reduce overhead.

The application was not containerized, meaning it lacked the isolation containerization technologies like Docker provided. This absence of isolation between the development environment and production further increased the risk of configuration issues and compatibility problems when deploying the application. Containerization would offer improved deployment consistency, scalability, and ease of management.

Our customer’s project solution architecture:

Solution

Prior to partnering with Gart, our customer settled defined acceptance criteria for the services provided:

A) Introduce 4 types of environments:

Production environment – the same account as it is now, same VPC – don’t move production from the existing account
Сreate a separate testing stage and dev environments – production and non-production accounts, and place workloads there. There should be a reasonably quick time to spin up a new development environment in AWS (including a new DB instance)
Local dev environment (create a local sandbox using e.g. docker-compose with all workloads – UI, backend, database, cron server). Cron servers on old infrastructure should be switched off on migration to avoid double jobs execution.

B) Create containers for:

Frontend application (React)
Backend application (Node.js)
Cron server application (Node.js)
NEST application (NEST.js)

C) Introduce CI/CD pipelines for 4 containers (AWS ECS) for workloads
D) Introduce the job to run integration and UI tests
E) Introduce roles and policies for Developer and DevOps + viewer role
F) Introduce metrics and alerts for a cluster:

Memory utilization (with alert)
CPU utilization (with alert)
Disk space utilization (with alert)
Restarted containers (with alerts)

G) Document the production deployment procedure as a guidebook.
H) Document rollout procedure (in case of failure) — that MUST include Database restoration to the previous state (if database migration scripts were applied).
I) Achieve an acceptable level of Downtime on migration to new infrastructure – up to 1 hour.
J) Old servers should be switched off due to workload migration to new infrastructure (switch from EC2 IP addresses to ECS load-balancer).
K) The QA team will accept the quality of the environment before making it publicly accessible.