Return to site

Aws S3 Bucket Explorer

broken image


S3 Cost Optimization

The NIH has recently started to make SRA data available directly on AWS S3. It would be cool if SRA-Explorer could also link to these. The complication is that not all datasets are available, and they are spread across more than one S3 bucket. Once the permissions have been put into place, you should be able to access your S3 storage buckets through Visual Studio and AWS Explorer. If you look at Figure 4, you can see that the console tree contains an S3 container. You can also see that I have expanded this container to reveal a storage bucket named VS Posey. Amazon S3 provides APIs for creating and managing buckets. By default, you can create up to 100 buckets in each of your AWS accounts. If you need more buckets, you can increase your account bucket limit to a maximum of 1,000 buckets by submitting a service limit increase.

Amazon Simple Storage Service (Amazon S3) is one of the most popular Amazon Web Services (AWS) offering with flexible pricing. Just pay for the used storage space. Many bloggers, including Werner Vogels CTO of AWS, host their blogs for less than a couple of dollars a month. On the end of the other spectrum, companies such as Sumo Logic use S3 to store petabytes of data. From our experience of using S3 and other AWS services, we are convinced that for most enterprises, S3 is one of the top five biggest spends among all AWS offerings. In this article we will discuss different approaches for reducing Amazon S3 costs and improving your margin.

AWS S3 Pricing

There are three major costs associated with S3:

  1. Storage cost: charged per GB / month. ~ $0.03 / GB / month, charged hourly
  2. API cost for operation of files: ~$0.005 / 10000 read requests, write requests are 10 times more expensive
  3. Data transfer outside of AWS region: ~$0.02 / GB to different AWS region, ~$0.06 / GB to the internet.

Based on volume and region the actual prices differs a bit, but optimization techniques stay the same. I will use the above prices in following cost estimates.

Operational Visibility From AWS

Machine data holds hidden secrets that deliver true insights about the operational health of your AWS infrastructure. Learn more about operational visibility from AWS today!

S3 Costs Basics

One of the most important aspects of Amazon S3 pricing structure is that you only pay for the storage used and not provisioned. For example, for 1 GB file stored on S3 with 1 TB of storage provisioned, you are billed for 1 GB only. In a lot of other services such as Amazon EC2, Amazon Elastic Block Storage (Amazon EBS) and Amazon DynamoDB you pay for provisioned capacity. For example, in the case of Amazon EBS disk you pay for the size of 1 TB of disk even if you just save 1 GB file. This makes managing S3 cost easier than many other services including Amazon EBS and Amazon EC2. On S3 there is no risk of over-provisioning and no need to manage disk utilization.

Given this, most S3 users don't need to worry about cost optimization right away. Engineering time is not free. The best bet is to start simple and worry about the monthly S3 bill after it has crossed the certain threshold. However, there are few basics that are worth getting right as they may be costly to fix down the line:

Pick the right AWS region for your S3 bucket.

Ensure EC2 and S3 are in the same AWS region. The main benefit of having S3 and EC2 in the same region is the performance and lower transfer cost. Data transfer is free between EC2 and S3 in the same region. Downloading files from another AWS region will cost $0.02/GB

  • For example Sumo Logic processes data within the same region same region and mostly eliminates the S3 to EC2 inter-region data transfer cost. If the S3 bucket would be in a different region assuming that each file is downloaded on an average 3 times per month (3 * 0.02 ~= $0.06 / GB), our S3 costs would triple.

Pick right naming schema (AWS guide).

Though this doesn't directly impact the S3 cost, it may make S3 so much slower that you need to use an additional caching layer that sometimes can be avoided.

Don't share Amazon S3 credentials and monitor credential usage.

A lot of developers bake IAM access keys and/or secret keys inside their application. While this may be required for users to directly perform operations on S3 and may simplify your architecture, it also means that any user can potentially cause a lot of additional costs. This may be malicious or just a simple accident. At minimum:

  • Use temporary credentials that can be revoked. Give them need to know access (minimum rights) to complete the task.
  • Monitor access keys and credential usage on a regular basis to avoid any surprises.
  • A good example of this is on any S3 bucket where the third party can upload objects, you should setup CloudWatch alert on 'BucketSizeBytes'. This would prevent malicious users from uploading terabytes of data in your S3 bucket.

Never start with Amazon Glacier right away.

If you don't understand Glacier or your application requirements well, or in worse case your application requirement changes, then you end up paying a lot later on (e.g. there are many horrors stories). Keep it simple. Don't start with infrequent access storage class unless you don't plan to read this objects.

Analyze Your S3 Pricing Bill

The best way to start your cost optimization effort is to review your AWS bill:

  • On your AWS console review aggregated AWS S3 spend (link to AWS Console).
  • To get more granular per bucket view of your S3 prices, enable cost explorer or enable reporting to S3 bucket.
    • Cost explorer is the easiest to start with.
    • Downloading data from 'S3 reports' to spreadsheet gives you more flexibility.
    • Once you reach a certain scale (e.g. Sumo Logic bill is over 1 GB / month) using dedicated cost monitoring SaaS such as CloudHealth is the best bet.
    • Keep in mind that AWS bill is updated every 24 hours for storage charges, even if S3 storage is charged by an hour.
  • Getting per object data can be handy, but beware of the cost if you require it on a regular basis.
    • You can enable S3 Access Log that provides you entry for each API access. Keep in mind that this access log can grow very quickly and cost a lot to store
    • You can list all objects using API. Either write your own script or use some third party GUI such as S3 browser.

E.g. 85%+ of AWS S3 costs for Sumo Logic are related to storage. The second group is API call which is around +10% of the S3 cost. However, there are some S3 buckets where API calls are responsible for 50% of costs. We used to pay for data transfers, but right now this cost is negligible.

S3 Cost Optimizations

It usually makes sense to focus on areas where you spend the most – storage, API or data transfer. Some cost optimizations improve your overall efficiency while the other, automate waste reductions.

Save money on AWS S3 storage fees

Don't store files that you don't need! Here are some ideas to consider for reducing your storage costs.

  • Delete files after a certain date that are no longer relevant.

    A lot of deployments uses S3 for log collection, but later send them to Sumo Logic. You may automate deletion using S3 life cycles. Delete objects 7 days after their creation time. E.g. if you use S3 for backups, it makes sense to delete them after a year.

  • Delete unused files which can be recreated.

    Same image in many resolutions for thumbnails/galleries that are accessed rarely. It may make sense to just keep original image and recreate other resolutions on the fly. E.g. Sumo Logic binaries. We can rebuild our binaries using git. We need them to deploy a new version of software, but there is little sense to keep binaries older than one year. We use lifecycle rules to delete them.

  • When using S3 versioned bucket, use 'lifecycle' feature to delete old versions.

    By default delete or overwrite in S3 versioned bucket keep all data forever and you will pay for it forever. In most use cases you want to keep older version only for certain time. You can setup lifecycle rule for that.

  • Clean up incomplete multipart uploads.

    Especially if you upload a lot of large S3 objects any upload interrupt may result in partial objects that are not visible, but you pay to store them. It almost always makes sense to clean up incomplete uploads after 7 days. If you have a petabyte S3 bucket, then even 1% of incomplete uploads may end up wasting terabytes of space.

Get up to 70% cost savings

Try Sumo Logic today to begin optimizing your AWS costs

Lower AWS Data Transfer Costs By Compressing Data

Almost always use some fast compression such as LZ4, which gives better performance and at the same time reduce your storage requirement and hence the cost.In many use cases, it makes sense to use compute-intensive compressions such as GZIP or ZSTD.

You usually trade CPU time for better network IO and less spend on S3. E.g. Most of Sumo Logic objects are compressed by GZIP, but we are investigating better compression. Most likely we will migrate to ZSTD. This gives us better performance and we use less space.

In Big Data Applications, Your Data Format Matters

Using better data structures can have an enormous impact on your application performance and storage size. The biggest changes:

  • Use binary format (e.g. AVRO) vs. human readable format (e.g. JSON). Especially, if you store a lot of numbers then binary format such as AVRO can store bigger numbers with lesser storage as compared to JSON. For instance, '1073741007' takes 10 bytes in JSON versus number represented in AVRO as 4-bytes integers.
  • Using row-based vs. column based storage. The general rule of thumb is to use columnar based storage for analytics batch processing which can provide better compression and storage optimization. However, this topic deserves its own article.
  • What you should index, store metadata or what should you calculate on the fly. Bloom filter may reduce the need to access some files at all. Some indexes may waste storage with a little performance gain. Especially if you have to download the whole file from S3 anyway.

Use Infrequent Access Data Storage Class in Amazon S3

Infrequent access (IA) storage class provides you the same API and performance as the regular S3 storage. IA is approximately four times cheaper than S3 standard storage ($0.007 GB/month vs $0.03 GB/month), but the catch is you pay for the retrieval ($0.01 GB). Retrieval is free on standard S3 storage class.

If you download objects less than two times a month, then you save money using IA. Let's consider following three scenarios where IA can considerably reduce the cost.

  • Scenario 1 : Using IA for disaster recovery

At Sumo Logic we use IA for backups. These backup files are used for disaster recovery. It makes sense for us to directly upload any object over 128KB to IA and save 60% on storage for a year without losing availability or durability of the data.

  • Scenario 2: Using automation to move unwanted files to IA

At Sumo Logic we use S3 to distribute binaries. We usually use them just up to a month after initial upload to S3. However, in the rare case, we still need the capability to quickly rollback to an older version. We use S3 lifecycle to automatically move binaries after 30 days to IA. This approach reduces the cost without compromising data availability.

  • Scenario 3: Use IA for infrequently accessed data

Aws

Sumo Logic's system is designed in such a way that in few places we use S3 as a final source of truth for reliability reasons, but we only access it when an EC2 machine goes down or during data migration, which is infrequent. Given that there is some class of S3 objects that is downloaded on average 20% times a month, it makes sense just to keep them in IA. For every 1GB, we save $0.021 GB / month S3 Standard cost GB/month – IA Standard Cost GB/Month – IA Access cost=0.03 – 0.007 – 20% * 0.01). Multiply that by a petabyte and that's just the monthly savings.

IA is great, but when is it not?

IA has the restrictions such as minimum data size cost and minimum storage retention period. IA charges for at least 128KB data and minimum 30-day storage. In addition, data migration to and from 'S3 standard' costs one API call.

However, IA is significantly easier to use than Glacier. Recovery from Glacier can take a very long time and any increase in speed will increase your cost. If you store 1TB of data on Glacier then you can extract that data for free at the rate of 1.7 MB / day. In order to recover 1TB in an hour will require 998 GB / h peak recovery rate. This will cost 0.01 * 998 * 24 * 30 = $7186! If you decide to recover 1TB in 2 hours, you will pay $3592.

How to Save on S3 API Costs

Here are some tips on how you can the reduce costs for your API access.

API calls cost the same irrespective of the data size

Aws

Sumo Logic's system is designed in such a way that in few places we use S3 as a final source of truth for reliability reasons, but we only access it when an EC2 machine goes down or during data migration, which is infrequent. Given that there is some class of S3 objects that is downloaded on average 20% times a month, it makes sense just to keep them in IA. For every 1GB, we save $0.021 GB / month S3 Standard cost GB/month – IA Standard Cost GB/Month – IA Access cost=0.03 – 0.007 – 20% * 0.01). Multiply that by a petabyte and that's just the monthly savings.

IA is great, but when is it not?

IA has the restrictions such as minimum data size cost and minimum storage retention period. IA charges for at least 128KB data and minimum 30-day storage. In addition, data migration to and from 'S3 standard' costs one API call.

However, IA is significantly easier to use than Glacier. Recovery from Glacier can take a very long time and any increase in speed will increase your cost. If you store 1TB of data on Glacier then you can extract that data for free at the rate of 1.7 MB / day. In order to recover 1TB in an hour will require 998 GB / h peak recovery rate. This will cost 0.01 * 998 * 24 * 30 = $7186! If you decide to recover 1TB in 2 hours, you will pay $3592.

How to Save on S3 API Costs

Here are some tips on how you can the reduce costs for your API access.

API calls cost the same irrespective of the data size

API calls are charged per object, regardless of its size. Uploading 1-byte costs the same as uploading 1GB. So usually small objects can cause API costs to soar.

PUT calls cost $0.005 /1000 calls.

For instance API cost is negligible if you have to upload 10GB in a single file. If a file is divided in 5MB chunk it costs ~ $0.01. However, for 10KB file chunks, it will cost you ~ $5.00. You can see the exponential growth in cost as you upload smaller files.

Batch objects whenever it makes sense to do so

Usually, a lot of tiny objects can get very expensive very quickly. It makes sense to batch objects If you always upload and download all objects at the same time, it is no-brainer to store them as a single file (using tar). At Sumo Logic we usually combine it with compression.

You should design a system to avoid a huge number of small files. It is usually a good pattern to have some clustering that prevents small files.

For example, instead of creating a new file, you can group the data in the same file until 15 seconds have elapsed or file size is 10MB. Create new file every 15 seconds and or, every 10MB whichever you hit first.

If you have tiny files, it usually makes sense to use some database like DynamoDB or MySQL instead of S3. You can also use a database to group objects and later upload it to S3. 10 writes per second in DynamoDB cost $0.0065 / hour or $(0.0065/3600) /sec. Assuming 80% utilization, DynamoDB provide $0.000226 / 1000 calls ([{0.0065/3600}*1000] / (10 * 0.8) ) vs. S3 PUT at $0.005 / 1000 calls. That is 95% cheaper to use DynamoDB over S3 in this use case.

The S3 file names are not a database. Relying too much on S3 LIST calls is not the right design and using a proper database can typically be 10-20 times cheaper.

How to Save on AWS Data Transfer Costs

Aws S3 Bucket Size Limit

If you do a lot of cross region S3 transfers it may be cheaper to replicate your S3 bucket to a different region than download each between regions each time.

Aws S3 Explorer

1GB data in us-west-2 is anticipated to be transferred 20 times to EC2 in us-east-1. If you initiate inter-region transfer, you will pay $0.20 for data transfer (20 * 0.02). However, if you first download it to mirror S3 bucket in us-east-1 then you just pay $0.02 for transfer and $0.03 for storage over a month. It is 75% cheaper. This feature is built into S3 called cross region replication. You will also get better performance along with cost benefits.

If there are a lot of downloads from the servers which are stored in S3 (e.g. images on consumer sites) then consider using AWS content delivery network (CDN) called AWS CloudFront. AWS CloudFront can be in some cases cheaper (or more expensive) than using S3. However, you gain a lot of performance.

There are CDN providers such as CloudFlare who charge a flat fee. If you have a lot of static assets then CDN can give a huge savings over S3, as just a tiny percent of original requests will hit your S3 bucket.

You may use S3 to save on data transfer between EC2 in different availability zones (AZ). The data transfer between two EC2 in a different AZ costs $0.02/GB. The data transfer between two EC2 in different AZ costs $0.02/GB, but S3 is free to download from any AZ. Avi player with cast.

Consider the scenario where 1 GB data is transferred 20 times from one EC2 server to another in different availability zone. It will cost $0.20/GB (20 * 0.01). However, if you are able to upload it to S3, then you just pay for storage ($0.03 / GB / month) and the best part is that data transfer between S3 and EC2 is free. S3 charges on per hour per GB. Assuming data is deleted from S3 after a day , the S3 cost will be $0.001. 99% cost savings on that data transfer by using S3.

Final Thoughts: AWS S3 Pricing and Costs

There are a lot of opportunities for S3 specific optimizations. Project management can be an absolute data-driven meritocracy. You can estimate the savings and the effort required to realize the savings. However understanding what's going on and managing the complexity can be challenging. In Sumo Logic's journey we were able to get 70%+ of savings over initial implementation. Optimizing costs can be as much fun as playing computer games. You tweak a few things and your AWS bill goes down.

Get up to 70% cost savings

Try Sumo Logic today to begin optimizing your AWS costs





broken image