Found a bug while migrating Redis from AWS ElastiCache to Redis Labs
We use AWS to run our microservices. Our stack needs caching to provide better performance to our clients. For caching, we used AWS ElastiCache for Redis, it was good for our use cases, until a day came and it made our services to perform slow and sometimes it was giving timeout. We got an alert at late midnight and we scaled-out the Redis cluster. Problem solved, but it needed our intervention.
If you prefer video explanation then you can watch on YouTube.
AWS ElastiCache provides scale-in and scale-in, clustering, security and it is fully managed. But, Auto scaling was missing from the list of features. We don't want our clients to face the problem and don't want our team to wake up at midnight. We started to find alternatives which can provide auto scaling of redis. And we got one - the Redis Labs.
We provide a list of microservices and its required memory size. They (Redis Labs) created a redis host, opened a certain port and created a password. Now, it was us to migrate from AWS ElastiCache to Redis Labs. We wanted to see its performance and wanted to do sanity across the API catalog for a week to ensure existing things are running and performing.
Now comes the fun part! Where we struggle to find the cause! It was weird and we never thought of it. It came from nowhere and gave us learning.
Our tech stack had a repository on GitLab, services hosted on AWS EC2, database was on AWS DynamoDB, code was written in Node.js and we were using Terraform and GitLab CI/CD for our deployment.
We were using a npm package ioredis to connect Redis. We changed our code to include password (Earlier, with ElastiCache required no password to connect due to attached policy on AWS). It worked on our local machine. We saw keys being created on Redis Labs. So, we were set to deploy on our staging server and test with production data. And we were not allowed to push changed code on the staging branch, because we don’t want to revert our code if something goes wrong. So, we created a branch from staging, changed the gitlab.yml file to honour the branch to deploy on stage server and pushed it.
Pipeline for the branch ran and service got deployed. The service was not stable though, it kept spinning. We checked the logs to see if there is a redis connection issue. We saw that the service was able to connect to the redis host, but there were also logs of dynamo-db error:
"tableName" is not allowed to be empty
at /usr/src/app/node_modules/dynamodb/lib/schema.js:91:13
at internals.Object._validateWithOptions
(/usr/src/app/node_modules/dynamodb/node_modules/joi/lib/types/any/index.js:654:20)
We debugged and couldn’t find any cause of this error. We changed redis connection, points to AWS ElastiCache and deployed on staging. We saw the same kind of logs and our service was spinning. After a couple of hours of struggle to find the cause. We started to read error logs line by line. We found a model named ‘custom’ in the log - Object.<anonymous> (/usr/src/app/src/models/custom.js:10:3)
We checked the ‘custom model’ file and saw that the table name was coming from the environment variables. On our local machine, it was working because the table name was present in the .env file. Then, we thought it should be present for the Staging server (yesterday, we had deployed on it). What could be the reason!
Our debugging leads us to GitLab CI/CD variable where all application variables are stored. We saw the table there in a variable (which was used in our code). But, the variable was the only one variable marked protected, which means it will be exported to the pipelines running on a protected branch. And our branch was not protected (we only created from a protected branch).
We set the branch as protected and deployed. Redis Migration was successful and we were able to see the generated keys on CLI.
Apart from protected variable mess, it is 15 mins job to migrate from AWS ElastiCache to Redis Labs (if CI/CD pipeline is ready and credentials are in hand).
Comments
Post a Comment