How I debugged huge Data Transfer costs on AWS
During the month of Aug 2018, our costs on AWS were sky-rocketing:
The projected costs were also increasing every day at a rapid pace. We could’ve hit around $4k by the end of the month. Something was going very wrong. When you are startup, time becomes the biggest constraint in situations like these because you don’t have the monetary cushion to fall back upon and debug it at your own pace.
This blog explains how I debugged this in under 2 days. It is meant to be a pointers guide for you and for me in the future, if we ever encounter high Data Transfer costs on AWS. Hopefully, it saves us time. Below are the steps I took to debug and finally fix the leakage.
Before I proceed any further, a small primer on our infrastructure:
- 75% of our servers are running on AWS. Remaining 25% on Heroku.
- All the EC2 instances on AWS are spawned up using Elastic Beanstalk(simply put, it is AWS’s Heroku but better)
- All the instances on AWS reside in a single region(Singapore) and a single Availability Zone(ap-southeast-1a).
- ElastiCache is the managed service which hosts our Redis cluster for caching and queuing. Again, the cluster is running in Singapore(ap-southeast-1a).
- Database being used is MongoDB hosted on a managed service provider — Atlas. The database cluster is also running in Singapore.
- The codebase in majorly running on NodeJS runtime.
Step 1: Verify the current month’s costs breakdown, compare it with the previous month’s costs and analyze what went wrong
You can find your Bill Details from AWS at:
On comparing July and August’s costs Details, I found out that the only major change in the cost was in “Data Transfer”. Other services remained consistent with the previous month’s costs. The “Data Transfer” costs had increased by 6–7 times. It went from $200 to $1000 and the month of Aug hadn’t ended yet. On exploring it further, I found:
My account had a consumption of over 100TB of data under “regional data transfer — in/out/between EC2 AZs or using elastic IPs or ELB” and it had cost over $1k.
Mind = Blown!!!
If it was EC2 costs or some other transfer costs, I could’ve been able to debug it faster because I have experience with those services and usually have an idea of what might be going wrong. Data Transfer costs were new to me. I needed more information to take any decision of how to proceed.
Step 2: Use AWS Cost Explorer to get more information
You can find Cost Explorer in the billing dashboard itself. This was the first time I was using Cost Explorer. I played around with it a bit to understand it. I mainly played around with the filters to understand how the graphs work. I used “Usage Type Group” and used exclusion principle to eliminate the services not taking up costs. The only service which had increased costs was “EC2: Data Transfer — Inter AZ”. Graph:
The cost had gone up from $3-$4 per day to $60-$70 per day from Aug 1.
On some research, I found out that the “EC2: Data Transfer — Inter AZ” is the cost of Data Transfer between instances located in different Availability Zones. But all my instances in EC2 and ElastiCache were located in a single AZ. I verified that all the “Network Interfaces” in my EC2 were located in the same AZ(It could’ve been that my load balancers were located in a different AZ). Ideally, this cost shouldn’t have been there even I was transferring petabytes of data. I had no idea what to do next.
I thought of moving some of my apps to Heroku but due to the usage of ElastiCache, I was restricted to AWS. Also, the kind of scale we are operating at, shifting instances to Heroku was not an option because I’ve seen Heroku buffer requests before, thus increasing the response times significantly. Decided to contact AWS Support.
Step 3: Contacted AWS support to get more information about the costs
I subscribed to the “Business Support” immediately as I was constrained by time. Business Support costs $100/month but the resolution to the issues is faster. As it is, I was losing almost $100 in 2 days.
Suggestion 1: After having a back and forth about “EC2: Data Transfer — Inter AZ”, they told these costs even include the cost of data transfer between my instances in one AZ and some other customer’s instances in another AZ.
Suggestion 2: Use AWS VPC Flow Logs to monitor IP traffic coming in and going out from all my network interfaces. Use IPTraf to monitor the traffic going in and out of a particular interface. Use Iftop to monitor network flows onto the system.
Suggestion 3: Block port 22 and port 80 from the world where it wasn’t required. I already had that configuration done in my AWS Security Groups. So, this wasn’t a problem.
Suggestion 4: Use private IPs to communicate between my services as data transfer via public IP is also charged in “EC2: Data Transfer — Inter AZ” costs. This wasn’t possible in my architecture because my architecture is a load-balanced, auto-scaling, rolling-deployment arch where the instances can be terminated and started anytime and the private IPs will keep changing. Unless I implemented my own load balancer, this suggestion wasn’t useful.
Suggestion 5: One of my instances was using up 1 TB of data transfer intermittently. They were confident that this instance was inflating the bills. Explored this by creating graphs for “Network In” and “Network Out” on CloudWatch. It was using up data for sure but the data spikes had no correlation with the consistent price increase. I did not believe this was the reason for high costs.
I decided to use Suggestion 2 to get more data on Suggestion 1.
Step 4: Use AWS VPC Flow Logs to monitor the traffic coming in and going out from all my network interfaces on AWS
I used VPC flow logs in conjunction with S3.
Setting up Flow Logs with CloudWatch is a real pain in a time constrained situation.
It takes 15–20 minutes for the logs to start showing up. On downloading and going through the log CSVs, below are a few sample rows of what I found:
- The number of bytes being transferred were HUGE!! — close to 1GB in some cases.
- I quickly eliminated the last 2 rows because both
srcaddr
anddstaddr
start with172.31.*.*
. I know 172.16.0.0–172.31.255.255 is a private IP range which meant the data transfer was happening in between instances in my AZ and so, wouldn’t have contributed to the cost. - For the remaining,
srcaddr
was a public IP and the port was27017
.27017
is the port which is usually used by MongoDB process. - I filtered the
dstaddr
for the remaining rows in AWS EC2 Instances and found it was my API microservice. - I wanted more data to conclude that something was actually wrong while transferring data between API and MongoDB. I SSH’d into my API EC2 instances and installed IPtraf and Iftop. On playing around with them for a bit, I could see that 13.230.24.140 was consuming data at an alarmingly high rate in real-time.
Step 5: Finding out which MongoDB cluster was having a high network transfer
I have multiple MongoDB clusters running on Atlas. I checked out the metrics of all my clusters on Atlas. One cluster stood out in “Network” usage:
Average network usage of 20 MB/s was HUGE! It meant the cluster was transferring 1 GB of data every minute. Decided to see if this has always been the case. Our scale is good but not that huge.
The network usage spiked suddenly on 7 Aug. This correlated with the increase in data transfer costs on AWS but not 100%. There was also an orange line which correlated with the spike. I zoomed in the graph to see if there were any more lines.
- There were 2 lines on Aug 7 around 12:30 PM.
- A red vertical bar indicates a server restart.
- An orange vertical bar indicates the server is now a primary
I remembered I had scaled my cluster up that day. I scaled it up that day because my CPU usage had increased suddenly and there was Steal CPU(Steal CPU happens when a shared instance is being used and the hypervisor has to intervene when the client is using more resources than allocated for longer period of time). As visible, Steal CPU started around 3 Aug but the process CPU has been high from 30 Jul.
The above gave me a 100% correlation with the billing increase:
- Something led to the CPU maxing out from 30 July. I believe it was the network usage itself although it does not show up in “Network” graph. The cluster instance was a shared one. So, network usage would be low per second but would be high overall.
- This led to CPU Steal from 3 Aug.
- On noticing this on 8 Aug, I scaled the cluster which led to an immediate network usage increase because the cluster got more bandwidth to play around with. CPU usage decreased because of this but Network usage spiked.
Now that I had a 100% correlation with the billing increase from MongoDB end, I wanted a correlation from EC2 instances. The VPC flow logs contained a lot of data. I wanted to be completely sure that only API service was the issue.
Step 6: Finding out the service on AWS which was having a high network transfer
As all my services run on Elastic Beanstalk on AWS, I added metrics for “NetworkIn”, “NetworkOut”, “AvgEstimatedProcessedBytes” on all of them. Eliminating services based on the metrics, I found out that only API was the issue. It had very high “AvgEstimatedProcessedBytes” and a very high throughput.
I had my 100% correlation with the billing increase from both MongoDB and AWS. Now, I had to find out why so much network transfer was happening between MongoDB cluster and API service.
Step 7: Finding out the highest throughput transaction on API
I knew that the highest throughput transaction doesn’t mean it uses up the highest network usage but it was a pretty good place to start with. I used NewRelic to get this data. I already had it installed on my API since the beginning.
During the time of debugging, /resource-1
had a throughput of 2k requests per minute(rpm).
To fix this, I used stdout logs to log the code paths from where /resource-1
was being accessed. I fixed this and brought it down to minimal calls but my hunch was confirmed. The network usage didn’t reduce either on API or on MongoDB even though the throughput reduced drastically on API.
I had a hunch that /resource-3
was the culprit because I knew the corresponding collection for /resource-3
stores a lot of encrypted data in each document. To confirm it was the culprit, I used morgan npm package to log each HTTP request’s and corresponding response’s metadata hitting API. I found that requests hitting /resource-3
with one particular URL query params was transferring close to 7MB of data in each response. I just had to find the specific query params for /resource-3
URL in my codebase and fix it. The reason for such high data transfer was that the code hitting /resource-3
was optimized for one particular type of scale but broke completely when another type of scaling problem hit us.
On fixing this, there was an immediate drop in the network usage both on API and on MongoDB cluster:
Our “EC2: Data Transfer — Inter AZ” cost now:
Learnings
- Always think about scale when writing code. Think about what might happen when you are processing a million transactions each second. Prepare your code to handle the worst cases. The best cases would be handled automatically.
- Have different tools at your disposal which can help you out when the need arises. Every data point in situations like these is helpful. I don’t use NewRelic during my everyday tasks but I still had it monitor my API service continuously for days like these. If I didn’t have it, it would have taken much more time to get to the root cause of the issue. Eg: I didn’t have VPC flow logs running on my EC2 from before and that ate up some time during debugging.
- Have proper alerts set up to notify when things start going wrong. I noticed the CPU Steal issue with my MongoDB cluster a week after it started. I noticed the issue on increasing costs on AWS almost 15 days after it started. This is unacceptable in a production environment and a startup which is gaining scale every day.
Tools used, in chronological order: