Solution at hand –
1. Using UUID – Index size is a key consideration if uuid is used as index. Some UUID types are completely random and have no natural sort.
Pro – Each application thread generates IDs independently, minimizing points of failure and contention for ID generation. If you use a timestamp as the first component of the ID, the IDs remain time-sortable.
Cons – Generally requires more storage space (96 bits or higher) to make reasonable uniqueness guarantees. Some UUID types are completely random and have no natural sort.
2. Using a Ticket Server – This is one of the very famous approaches where you can simply maintain a table to store just the latest generated ID and every time a node asks for ID they make a ‘select for update’ on this table, update the value with a incremented value and use the selected value as the next ID.
This approach is resilient and distributed in nature. The ID generation can be separated from the actual data store. However there is a risk of Single Point of Failure as all the nodes rely on this table for the next ID and if this service goes down your app may stop functioning properly.
Additionally MySQL shards are built as master-master replicant pairs for resiliency. This means we need to be able to guarantee uniqueness within a shard in order to avoid key collisions. We’d love to go on using MySQL auto-incrementing columns for primary keys like everyone else, but MySQL can’t guarantee uniqueness across physical and logical databases.
Also this approach might not be suitable in case where the writes per second are very high because that will overload the Ticket Server and also degrade your app performance. Flickr Ticketing service
Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
An additional couple of machines (or EC2 instances) to admin.
If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.
3. Twitter Snowflake –
Snowflake is a service used to generate unique IDs for objects within Twitter (Tweets, Direct Messages, Users, Collections, Lists etc.). These IDs are unique 64-bit unsigned integers, which are based on time, instead of being sequential. The full ID is composed of a timestamp, a worker number, and a sequence number.
This approach tackles the problem of SPOF as well as the latency issues.
– Here the ID is generated as a concatenation of timestamp, node ID and Sequence number. 41 bits are allotted to timestamp. This also allows the higher bit to be sorted and so allows somewhat sorted data.
– Node ID can be assigned to any physical node when during its startup and it can be retrieved from a shared cache in the cluster. Node ID can occupy next 10 bits. This number are coordinated by Zookeeper.
– The Sequence number can be a monotonically increasing 12 bit number.
Twitter has Snowflake service which is open source.
– Snowflake IDs are 64-bits, half the size of a UUID
– Can use time as first component and remain sortable
– Distributed system that can survive nodes dying
Would introduce additional complexity and more ‘moving parts’ (ZooKeeper, Snowflake servers) into our architecture
4. Instagram built there own stack. Before starting out, we listed out what features were essential in our system:
- * Generated IDs should be sortable by time
- IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
The basis for this is the initial bits(40) represent timestamp and the rest of the bit is formed based on other info like – node-id, machine-id.
Each of our IDs consists of:
– 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
– 13 bits that represent the logical shard ID(can be used id )
– 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond