DEEP DIVE ON DYNAMO DB
SQL vs NoSQL
- SQL is used for optimization of storage because it uses normalized data structures that deduplicate the data on the disk and also it supports ad-hoc queries to generate materialised views that the application needs. This helps the application not only to select individual values but needs to select data from a table and join another table. For example, you may have a customers table and an orders table and you may want all the orders for the customers so it would be “select from customers inner join orders” whereas in NoSQL you would place customers and orders in a single table/document, it would be faster. SQL scales vertically, you would need to buy a new stack. However NoSQL can scale horizontally/incrementally.
- SQL is best in OLAP space / online analytics
- OLTP apps on the other hand, they always process the data exact same way when the user clicks the button, same things happen so you can store data in a normalised way. If you really want performance and scale in OLTP (Online Transaction Processing - transaction oriented applications, typically data entry and retrieval transaction processing) you would need NoSQL.
- Normally in SQL, let’s say you have products table that would tie to a bunch of others through common relationships, one-to-one, one-to-many and many-to-many relationships. What kind of queries you would run to get a list of all products? 3 different queries, 1 is easy but the other 2 would be pretty complex and think about how much time that CPU is spending and hopping around the disk to build this materialized view. In NoSQL, you don’t want normalized data, you want aggregations! You would want hierarchies as stored items and denormalised view.
- It is a fully managed NoSQL
- Document or key-value database
- Scales to any workload
- Fast and consistent
- Access control - IAM users and permissions on a table and item level
- Able to adapt event driven programming - where you would use in Serverless
Dynamo DB Core components
- Table: Tables have items and items have attributes. Mandatory attribute is the partition key which will uniquely identify the item. Optional is the sort key. The combination will identify the uniqueness of the data. This is almost like a bucket. In the case of customers and orders, you might imagine the partition key could be the customer ID the sort key could be the Order ID, now all of the orders for a given customer will fall in to the same partition key. You wanna line all those items on the disk and would be able to do a sequential read of this disk. Sort key gives the ability to execute complex large queries against those attributes.
Scalability of Dynamo DB
- Partition keys are typically hashed to create a randomized range or key space. DynamoDB takes those hash values and line those items up across that key space and as you increase the capacity or the size of the table DynamoDB will chunk that key space and split it across multiple nodes, this is a fundamental principle in NoSQL. 2 dimensions
- Throughput: How much you need? Read/Write capacity to split into partitions. This is provisioned at the table level. Write capacity and Read capacity independently. Write is measured in 1 KB per second and Read in 4 KB per second. Read Capacity Units measure strictly consistent reads. Eventually consistent reads cost 1/2 of consistent reads.
- Size: Max item size is 400KB
- Number of partitions is calculated by looking at Capacity (Total RCU / 3000) + (Total WCU / 1000) and by size (Total size / 10GB). DynamoDB will give you the partitions as CEILING(MAX (Capacity, Size)). This is given to evenly distribute the RCUs and WCUs across these partitions
- If sustained throughput goes beyond provisioned throughput per partition. You might get a burst bucket provided to you for a short period of time. Throttling means you would have a hot key on a particular partition that goes for too long you exhaust that and you are gonna be throttled. Causes
- Hot keys/partitions
- Very large items
- Time series data, getting table bigger and bigger so as the partition and eventually throughput is lower
How to get most out of DynamoDB
- Evenly distributed data and partitions and time within your DB
- Data Modelling in NoSQL: It’s all about aggregations. Therefore, it’s important to model in that way
- Use cases: Social Network, Document Management, Process Control, IT monitoring and Data trees
- Hierarchical Data - How would you represent? Create a series of items and store these in partitions. As JSON. It is useful in order to save cost in WCU whenever you insert new items
Design Patterns and Best Practices
1) DynamoDB Streams and Lambda
- Lambda as the best Stored Procedures (A stored procedure is a group of one or more database statements stored in the database’s data dictionary and called from either a remote program, another stored procedure, or the command line.)
2) Real-time voting
- There is a long list or short list of candidates. One or two options that get a lot of votes. If I am trying to aggregate votes on a table and aggregating votes by candidate ID and what happens is two keys very busy, that’s the definition of a hot-key. The way to deal with these is to take those candidate IDs and add some arbitrary random value within a known range to the end of those like CandidateA_4, CandidateA_5 etc. This way I am spraying the votes across partitions. Later on I execute a scatter gather to merge the results. Shard write-heavy partition keys!!! Trade-off cost for write scalability and consider throughput per partition key.
3) Event Logging
- Time series data: Two types Static Time Series Data: This would be an event monitoring system, ITSM type of app, event streams coming in off from different devices. This is basically dealing with Operational Analytics data over a period of time. Therefore, after that period of time, that data is not so relevant anymore, less queries will be operated. You will create a hot table for that purpose that has high write capacity high read capacity taking all the ingestion. Then you will turn it to a cold or warm table where there will be zero write capacity. Data Lifecycle.
- Dynamic Time Series Data: Could be session data or user data. Every time the user touches the data, you need to upgrade the time stamp on that data. The data on the table pretty much spreads out all over the place, some data is hot some cold. You have to build a TTL index by creating an attribute on every item whenever you insert the event or the object on the table every item gets an attribute that’s a GSI key and it’s a random range between 0 to n. What you are doing is the right sharding the GSI. What you do is to create a GSI on that GSI key attribute and you arrange it on the timestamp of all your items. What you achieved is basically rotated the entire table into a single index on one attribute and you can execute a range query via scatter gather against that TTL index and get a very selective result set which won’t burn a lot of capacity because you don’t need to scan every item. You can basically quickly retrieve the expired items by AWS Lambda running as a scheduled process.
4) Product Catalog
- In retail space, or sales activities you got people looking at selective items from your catalog. You would implement a cache where you can put those high-velocity items in that cache and you’ll read from the cache first.
5) Multi-version Concurrency (Transactional NoSQL)
OLTP apps use the data always in a hierarchical way. They use also top-level entities, so called as entity-driven workflows, a customer, a product. These are highly relational. Use item partitions to manage transactional workflows
- Manage versioning across items with metadata
- Tag attributes to maintain multiple versions
- Code your app to recognize when updates are in progress
- Implement app layer error handling and recovery logic
6) Messaging APP
- You got a messaging app, let’s think about email. Could be a video game or something where I have a messaging app infra but I have got some messages table. And I want views of that table, inbox and outbox. You can create a table “messages” that is hashed on recipient or partition on recipient and sorted on date. If the average message size is 256 KB, and when you go and select 50 items, and let’s assume you apply eventually consistent read which is half the cost and then I will still eat up 1600 RCU (50 * 256KB * (1RCU/4KB) * (1/2)). View of the inbox, you are not interested in the view but only the metadata includes subject etc. Take a relational book and vertically partition this table as an old strategy, then you will create another GSI with only metadata
7) Multiplayer Online Gaming
- Table with user invites to games. Bunch of games done, progress and pending. What you want is to retrieve the users in pending within a particular timeframe. You would apply range queries against sort key attributes!!! You would apply a filter on condition additionally that only knocks out items after they have been read
- Sparse indexes should be used in DynamoDB to minimize database size (and cost). Look up “local secondary instances” in the DynamoDB guide. RDS is not suitable for this, it would be almost impossible to archive and the database would just keep growing. With archiving DynamoDB would be a much more cost effective solution. DynamoDB allows for higher write throughput at lower cost than traditional relational databases.
ElastiCache Redis vs Memcached
- complex data objects vs simple key value storage
- persistent vs non persistent, pure caching
- automatic failover with Multi-AZ vs Multi-AZ not supported
- scaling using Read Replicas vs using multiple nodes (not good for backups/restore)
- backup & restore supported vs not supported
- Memcached supports multi-threaded operations
- ElastiCache supports backup and restore
DynamoDB and SQS - R/W throughput optimisation
- How can I reduce the read/write throughput especially in a spiky workloads?
- Use SQS as a fronting agent to communicate with DynamoDB. If you are not worried about the fastest latency to get to DynamoDB, ok with a little bit of asynchronous writes, instead of writing directly to DynamoDB, you would have a SQS queue with a draining application that takes from SQS and writes into DynamoDB. With this, you can consistently have a smaller pipe going into Dynamo DB and have some level of backup within SQS and not worried about the scaling DynamoDB at those peak moments. Huge cost savings if your app can handle that latency…
DynamoDB Global Tables
- The idea of low latency global table is that a single write automatically handled by multi-regions, redundant, globally dispersed user-base.