About the Client
Splurge Art is an innovative platform designed for the emerging AI art community. It allows users to create AI-generated art using simple prompts, compete in daily themes, and exchange virtual coins for real money. The platform integrates social media features, an AI art generation tool, and a vibrant marketplace for art transactions.
Splurge Art’s primary goals include maintaining exceptional service availability with a target of 99.9% uptime, supporting up to 250,000 daily active users, providing 24/7 customer support, and enhancing software delivery performance by adhering to DORA metrics.
Challenge
Splurge Art faced several infrastructure challenges that necessitated a comprehensive audit:
1. Security Issues:
The platform’s identity and access management (IAM) lacked adequate multi-factor authentication (MFA) for most users, and there were outdated credentials without regular rotation policies. Additionally, security groups and network ACLs were not optimized for least privilege access, posing potential security risks
2. Cost Management:
Inefficient resource utilization and the absence of a comprehensive tagging strategy for cost allocation led to suboptimal cost management. There was also a lack of configured budgets and billing alerts to prevent cost overruns.
3. Reliability and Performance:
Key services like RDS and ECS were not fully utilizing multi-AZ deployments, critical for high availability and disaster recovery. Auto-scaling was only partially implemented, leading to performance inefficiencies.
4. Data Management:
Limited lifecycle policies and inconsistent backup testing across services indicated potential vulnerabilities in data management practices. Moreover, there was a need to expand backup strategies beyond RDS to include other critical resources.
Solution
To address these challenges, Gart conducted a thorough infrastructure audit and implemented the following solutions:
1. Enhanced Security Measures
- Enabled MFA for all users and reviewed inactive accounts for potential deactivation.
- Implemented regular rotation policies for credentials and enforced least privilege access across IAM policies.
- Reviewed and optimized security groups and network ACLs to ensure they followed the principle of least privilege. Investigated and removed unused security groups.
2. Optimized Cost Management
- Conducted a detailed resource utilization review and recommended right-sizing strategies, including the adoption of Reserved Instances or Savings Plans for cost savings.
- Developed and implemented a comprehensive tagging strategy to improve cost allocation and management.
- Configured AWS Budgets and billing alerts to monitor and control expenses effectively.
3. Improved Reliability and Performance
- Ensured critical services like RDS and ECS utilized multi-AZ deployments for better high availability and disaster recovery.
- Expanded the implementation of auto-scaling to match workload demands more effectively.
- Conducted regular performance reviews to identify and resolve bottlenecks.
4. Enhanced Data Management
- Broadened the application of lifecycle policies to include all relevant S3 buckets.
- Expanded regular backup testing to include additional resources beyond RDS, ensuring comprehensive data protection.
- Implemented Infrastructure as Code (IaC) practices with Terraform, including clear code organization, documentation, and integration with CI/CD pipelines for consistent and efficient infrastructure management.