Improve Deployment Infrastructure using 12 Factor: GSoC'23 Final Report šŸ“

The goal of this blog is to showcase, in detail, the work that Vaibhav Upreti did on CircuitVerse during Google Summer of Code 2023, which took place from May 29, 2023, to 28 August 2023.

CircuitVerse is a cool open-source platform which allows users to construct digital logic circuits online.

Table of Contents

Project Description šŸ“


The primary objective of my GSoC project was to upgrade CircuitVerse’s deployment infrastructure to meet the 12 factor standards, that would pave the way for a more efficient, scalable, maintainable, and robust platform. The project involved several important tasks, each contributing to the overall enhancement of the platform.

For a detailed description of the project, refer to the project page.

Accomplishments šŸ“œ


Here’s a concise summary of my achievements:

  • This is indeed the first time I’ve made changes that directly impacted hundreds of thousands of users through large-scale data migrations
  • My changes and optimisations resulted in a direct benefit to the organization by reducing infrastructure costs.
  • Successfully applied 12-factor principles, boosting scalability, reliability, and significantly reducing infrastructure costs.
  • Learnt a great deal from my mentor, a senior software engineer, about Ruby, Rails, software development practices and handling applications in production.

1. Make CircuitVerse a 12 Factor Application āš™ļø

I prioritized the implementation of 12 Factor principles throughout the development process.

An achievement was customizing CircuitVerse’s Docker image for wider usability, reducing memory consumption(by using jemalloc) and reducing Docker image build time.

Initialized CircuitVerse runbooks, as suggested by my mentor, which provide comprehensive documentation for production deployment, including all necessary background information.

2. Migrate Assets to AWS ā˜ļø S3 šŸŖ£

Large-Scale Migration: I led the migration of nearly a million assets, including user profile pictures and circuit images from old, deprecated Configuration (CarrierWave, PaperClip) to rails solution for handling file uploads called ActiveStorage on AWS S3. This transition not only improved storage efficiency but set the stage for seamless expansion.

My approach: Ensure zero downtime for users by mirroring uploads to both new(ActiveStorage) and old(Paperclip, CarrierWave) configurations, followed by data migrations and background jobs to backfill data.

Initially, we employed the data_migrations approach, maintaining a Redis counter for tracking progress and enhancing logging for insights. However, with growing server traffic, memory issues arose, leading us to transition to background jobs via Sidekiq. For this we utilized Shopify’s maintenance_tasks gem, employing a single job to migrate 1000 records.

Scalability & Cost Reduction: Migrating to object storage, specifically S3, not only reduced infrastructure costs compared to EBS due to its cost-effectiveness but also ensured scalability, making it a preferred choice for storing large volumes of data and accommodating future growth.

3. Improve Observability using OpenTelemetry šŸ”­

I configured distributed tracing with OpenTelemetry for CircuitVerse and exported the telemetry data to jaeger and new relic backend. This tracing system provides invaluable insights into our platform’s performance, enabling us to identify bottlenecks and enhance user experiences

OpenTelemetry’s architecture and its utilization in our service- Otel-arch

Jaeger Dashboard

Otel-arch

New Relic Dashboard new-relic-dashboard

Inspecting a trace

Otel-arch Otel-arch

4. Zero Downtime Deployment Pipeline with GitHub Actions and Kamal šŸ› ļø

Successfully set up a Continous Deployment Pipeline that deploys CircuitVerse Docker images to production using GitHub Actions and kamal with zero downtime.

Kamal uses the dynamic reverse-proxy Traefik to hold requests, while the new app container is started and the old one is stopped ā€” working seamlessly across multiple hosts, using SSHKit to execute commands. Originally built for Rails apps, Kamal will work with any type of web app that can be containerized with Docker.

The workflow consists of two jobs:

  1. build-production: This job builds the Docker image and pushes it to the registry for linux/amd64 and linux/arm64 architectures. The build process is optimized using docker buildx caching, significantly reducing build times.

  2. deploy: After the build job completes, the deploy workflow requires a review by a repository committer. Once approved, it sets up Kamal and deploys the latest Docker image tagged with the GitHub SHA hash from the repository’s current origin.

kamal-job

As we can see in the image above the deploy job has protection rules for the “production” environment in GitHub Actions. When a newer deploy job is enqueued, it cancels the previous workflow to ensure the latest image is deployed.

In the deploy action, Kamal performs several key tasks:

  1. pulls the image from the registry
  2. runs healthchecks on the servers at http://localhost:3999/up route.
  3. If the healthchecks are healthy, Kamal proceeds to swap the existing container with the newer version.
  4. However, if the health check fails, Kamal acquires a lock on the deployment to prevent any conflicts or issues during the update process.

Hence, in CircuitVerse CI workflows, we build Docker images for each pull request to the master branch, helping developers validate their code for production readiness.

Memory Optimisation: Configured Jemalloc for Docker image, reducing memory fragmentation.

Deploying CircuitVerse to staging environment successfully.

cv-staging

Feeback staging-feedback

5. Monitoring Server with Monit šŸ”Ž

Introduced Monit, Monit is an open source server monitoring tool, it conducts automatic maintenance and repair and can execute meaningful tasks.

I added Monit configuration for the following services:

  • Sidekiq
  • Procodile
  • Postgres
  • Redis

Monit promptly restarts services and sends SMTP alerts when a service goes down or reaches its alert limit

Monit Alerts monit-alerts

6. Drop visitor tracking by storing user details and adopt HyperLogLog for project view counts šŸ—‚ļø

HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. As a probabilistic data structure, HyperLogLog trades perfect accuracy for efficient space utilization. Thus this algorithm can estimate the number of unique values within a very large dataset using little memory and time.

Transition Strategy: I evaluated multiple HLL (HyperLogLog) libraries, prioritizing solutions aligned with ease of setup, precision, and strong community support.

We had three options:

  1. Utilize the postgres-hll extension, incorporating a separate HLL field for projects.
  2. Implement Redis HyperLogLog
  3. Store HLLs as text in the PostgreSQL database.

Most of the libraries that evaluated HLLs were outdated, hence the idea of storing HLLs as text in the database was temporarily shelved. Additionally, others had external dependencies that could complicate setup for new contributors. Using Redis HyperLogLog counters appeared viable(just like GitLab uses HLL counters) but would entail higher infrastructure costs. After discussions with my mentor, we decided to exclude this from the program’s scope due to the need for further research and potential complexities.

Feedback šŸ“ˆ


vu-midterm-feedback

Pull Requests šŸ”„ & Blogs šŸ“–


Repo - CircuitVerse/CircuitVerse

Pull RequestDescription
fix: erb tagsFix for erb tags in the codebase
feat: mirror pfp & projects, backfill profile_picturesAdded a feature to mirror pfp & projects,while simultaneously backfilled profile_pictures
feat: migrate image_preview to AWS S3Migration of image_preview to AWS S3 storage
chore: update rails to 7.0.5.1Updated Rails version to 7.0.5.1
fix: use env[] instead of fetchCode fix to use env[] instead of .fetch
feat: make member since field more readableAdded a feature to make the ‘member since’ field more readable
feat: distributed tracing using OpenTelemetryImplemented distributed tracing using OpenTelemetry
feat: continuous deployment workflow using GitHub Actions and KamalAdded a continuous deployment workflow using GitHub Actions and Kamal
feat: serve profile_pictures with ActiveStorageImplemented serving profile pictures with ActiveStorage
chore: disable generating spans for default settingsDisabled generating spans for default settings
fix: commentator profile_picture errorFixed commentator profile_picture error
chore: rerun image preview migrationReran the image preview migration
feat: migrate image_preview using SidekiqMigrated image_preview using Sidekiq
chore: make maintenance tasks migrations safeMade maintenance tasks migrations safe
chore: mark maintenance tasks migrations safeMarked maintenance tasks migrations as safe
feat: deploy CircuitVerse to staging using KamalDeployed CircuitVerse to staging using Kamal
feat: Serve assets using active storageServe Image Preview using ActiveStorage
feat: production deployment using kamalDeploy CircuitVerse to production using kamal

Repo - CircuitVerse/infra

Pull RequestDescription
feat: monit config files #1Added Monit configuration files
feat: Intialise runbook #3Initialized CircuitVerse runbooks
docs: distributed tracing using OpenTelemetry #5Documented distributed tracing using OpenTelemetry
docs: Kamal documentation #6Added Kamal documentation

Blog Posts

I published weekly blog posts throughout this period, which you can read at https://vaibhavupreti.github.io/hugo-blog/tags/gsoc

Featured posts:

What’s Next šŸ“…


Iā€™m excited to continue as a Core Team member, maintaining this incredible open-source project.

Additionally, we plan to implement a blue-green deployment approach implement the CD pipeline after rigorous testing in the staging environment.

  • Blue - older server
  • Green - current staging environment

This involves copying the latest production data to staging(latest pg_dump and redis data), Production traffic will continue on ‘blue’ until we replicate and scale ‘green’ to match or exceed its capacity. Once performance and stability are confirmed, we’ll transition production traffic to ‘green’, the staging server and phase out the older ‘blue’ instance, ensuring a risk-minimized transition.

Acknowledgments šŸ“


I’m grateful to my mentor Aboobacker M.K who helped me whenever I faced challenges and never overlooked any part of their mentoring. Taught me a lot of stuff around Ruby, Rails and Software Development in general. The weekly meetings were exceptionally informative, and I cannot overstate how much I learned through my interactions with my mentor. I doubt I will ever encounter a similar experience. Their dedication motivates me to aspire to become a software engineer like them and to share my learnings with others.

comments powered by Disqus