Identity Crisis: How Modern Applications Generate Unique Ids | by Eric Elliott | JavaScript Scene | Dec, 2022

TL;DR: Use so you don’t need to worry about all this identifier complexity.

It’s 7:15 am and customer support is swamped. We just got featured on Good Morning America, and a whole bunch of first time customers are bumping into bugs. A customer would fill a shopping cart with items, check out, pay for the items, and then be presented with an order receipt with all the wrong items!

What was happening? We were using GUIDs, and we trusted them not to collide. After all, they’re called “Globally Unique Identifiers” so they must be unique, right? The odds of collision for GUIDs are supposed to be one in 5.316912e+36. That’s a gigantic entropy space. Entropy is a measure of the total information in a system. In the context of unique ids, a higher entropy will lead to fewer collisions, and can also make it more difficult for an attacker to guess a valid id. If you’re still using UUID/GUID or database identifiers as your primary record keys, it’s time to re-evaluate.

That incident led me to create the Cuid specification, which was ported over 20 times to different languages, and used in thousands of applications, some of which have hundreds of millions of users. But in the last decade using it, I realized I was optimizing for the wrong things. Before we go into that, though, we need a little context.

Applications store data that needs to be uniquely identified so that it can be referenced and looked up. To do that, we create unique identifiers for the data.

One solution is to use database auto-increment ids, but this solution falls apart the moment you outgrow the data you can store on a single database host. Different hosts will generate ids that are already in use on other hosts. Successful modern applications scale to hundreds of millions or billions of users, each of whom will generate hundreds of individual records which each need a unique identifier. Those apps require lots of servers to handle the demand.

To solve this issue, Instagram used a distributed database system to generate unique ids. This allowed for horizontal scalability, but it has some downsides:

  • Set up and maintenance is complex.
  • It’s slow — when a client needs to create a new record, first it needs to request a new id from the distributed id generator, adding an extra round-trip to every record creation.
  • It doesn’t work at all offline. In order to create any new record complete with an id, the client must be connected to the internet.

Another solution is to use UUIDs or GUIDs. But UUIDs and GUIDs have their own set of issues. They’re large, they waste space on dashes, they can’t be used as identifiers in many languages (making them incompatible as keys in many data structures) and most importantly, in spite of their strong-sounding names and theoretically remote odds of colliding, they often generate duplicate ids. Due to a poor pseudo-random algorithm, some older implementations of V4 UUID can’t generate more than 10k ids without generating a collision.

Twitter Snowflake addresses these issues by using a 64-bit integer id, which is composed of a timestamp, a worker id, and a sequence number. Snowflake makes the assumption that the id-generating hosts are arranged neatly in uniquely identified workers, but that can’t be the case if you want to generate ids in a client without a round-trip.

Our ideal solution must be:

  • Secure
  • Collision resistant
  • Horizontally scalable
  • Offline-compatible
  • Identifier compatible
  • Fast and convenient — Id generation should not require a network request or asynchronous entropy

Identifier security is more important than ever, and Cuid did not go far enough.

  • Ids should not leak any information about the information it identifies
  • Ids should not leak any information about the host that generated the id
  • Ids should not leak the time that the data was recorded
  • It should not be feasible to guess another valid id given samples of other valid ids

If you go searching the web for different identifier standards, you may find impressive performance charts talking about the virtues of sequential ids for database performance. 10 years ago, I thought that was more important than leaking a timestamp, host fingerprint, or generation sequence. I was wrong.

Today’s applications are much more likely to protect valuable digital assets, money, access to e-commerce accounts, private data, and so on. We are entering the era of the internet-of-value, and that means we need to step up security. It doesn’t matter if a 60ms server round trip could have been 30ms with a sequential keyed lookup if the person looking up your private data is a hacker.

Id collisions are problematic because they can cause applications to:

  • Leak other people’s private data to the wrong users
  • Overwrite or lose important data
  • Confuse or misinform users by presenting incorrect information
  • Allow attackers to guess valid identifiers by exploiting collision statistics
  • Perform poorly if systems have to constantly check for and handle collisions

A good strategy to avoid collisions is to include a variety of entropy sources. This works because collisions are more likely to occur when all the entropy sources used to generate the unique identifiers are correlated or have low entropy (few possible values). Some good sources of entropy include:

  • The current system time
  • Pseudorandom values
  • A session counter
  • A host fingerprint

Each of these is weak on its own. It’s possible for many ids to be generated in the same millisecond in popular applications. Pseudorandom values are for due to poor random number distributions. And that’s not in somebody’s home-made random() function.

Those bugs existed for many years in PHP and JavaScript. If you think your favorite programming language is better, . Similarly, I’ve seen bugs in just about every popular programming language’s main random function. Lots of the bugs are fixed now, but it’s not a good idea to take chances. If you can add more entropy than a pure pseudorandom generator gives you, you absolutely should.

Many id solutions just prefix a random value with the current system time, but that actually doesn’t improve entropy much because most pseudorandom generators use the system time as a seed. You need more non-correlated entropy. A session counter helps by incrementing a counter each time a new id is generated, guaranteeing some entropy change with each id generated on a single host. I added it to the original Cuid to prevent collisions when many ids are generated in the same millisecond on the same host.

There are two more entropy problems: A single session counter does not supply much entropy, and it’s easy to predict subsequent ids if that’s all that changes between one id and the next. The final problem is that if you generate lots of ids across many different hosts, you still may get collisions because the session counters could overlap with each other across different hosts. In other words, when many different users are using the app on many different devices at the same time, two users could still generate the same id at the same time — unless we add a fingerprint that differentiates one user’s device from another user’s device. That’s the host fingerprint.

Cuid2’s JavaScript implementation of the fingerprint is particularly strong. It uses a very long string of global variable names, hashing them to mix their entropy with a random salt and each of the other entropy sources using a process that is impossible to reverse. Imagine throwing a bunch of different bits of fruit into a blender and mixing them together. Once that’s done, it’s impossible to extract just the banana. Cuid2 makes an entropy smoothie from all the entropy sources.

A good id generator should produce ids that have a random uniform distribution. In other words, when you plot the output on a graph, it should look like random static noise. It should not be too smooth. It should not demonstrate patterns or structures. It should look like randomly distributed static noise. Here’s a randogram for Cuid2:

Another test for uniform random distribution is a histogram chart. To generate one, convert the generated ids into numbers, divide the total entropy range into equally-sized buckets, and then generate lots of ids and sort them into buckets based on the value of the number. The result should be fairly uniform, but a little bumpy across the top. There should not be any obvious large peaks or gaps in the histogram. Here’s a histogram plot of Cuid2:

Modern applications need unique IDs that can scale to large user bases without colliding, handle multiple servers, and be used offline. And they must be secure.

  • UUIDs and GUIDs have issues with size, formatting, language compatibility, and poor randomization leading to collisions and security issues.
  • Server-managed ids cause slow round-trips and make offline operation complicated.
  • The original Cuid specification addresses many of these issues, but it does not do a good enough job at security.
  • The successor, Cuid2 addresses these issues by combining many entropy sources using a tiny, but well-distributed, one-way hash function. The hashing step prevents the ids from leaking information that might be used by hackers to attack the system, or user’s private data.
  • Cuid2 is a secure, collision-resistant, horizontally scalable, offline-compatible solution for generating unique IDs.

And just like the original Cuid, it’s tiny. Including the hash function, Cuid2 weighs in at less than 5k, gzipped.

Source link


Leave a Reply

Your email address will not be published. Required fields are marked *