ids

This document describes approach for generaiting a short, human friendly and unique entities identifiers without collisions.

Rationale

API-resource IDs should not be easy predictable because of the following reasons:

Even though the proper permissions validation needs to be implemented to control API resource access, in case of a vulnerability a predictable ID format allows to leverage the vulnarebility for a much bigger set of resources via IDs enumeration. If by having set of IDs it's hard to predict what other IDs could have been in the system the vulnerability's impact gets descreased.
Sequential integer IDs allow to estimate number of entities in the system which also may be beneficial for the competitors.

In order to address the challeges listed above a radnom ID generation may be used like UUID or just a random string of a certain length in the given alphabet.

But leveraging this approach requires IDs length to be increased in order to avoid collisions during ID generation. If there are $N$ possible IDs then the probability $P$ to not meet a collision during $K$ generations will be calculated as following:

$$P=\frac{N(N-1)...(N-K+1)}{N^K}$$

This follows from the following fact: there are $N^K$ of total possible generation results. From the other side to generate $K$ IDs without collisions there is $N$ choices (from $N$ IDs) for the first generation, $N-1$ choices for the second generation (now we need to choice from $N-1$ IDs because $1$ had been allocated on the previous step and cannot be used in the next generations), $N-2$ choices for the third generation (now $2$ of $N$ IDs had been allocated on the previous steps) and so forth. Therefore the total amount of possible results of sequential $K$ generations without collisions will be: $$N(N-1)...(N-K+1)=\frac{N!}{(N-K)!}$$

In fact this is exactly well known Birthday Problem and the probability formula has approximation:

$$\mathrm{e}^{-\frac{K^2}{2N}}$$

The probability to meet at least one collision during $K$ generations is $1-P$ therefore it's approximation is:

$$1-\mathrm{e}^{-\frac{K^2}{2N}}$$

Approximation significantly simplifies this probability calculation for the big values of $N$ and $K$.

In order to make IDs easily readable and writable by humans the ID alphabet must be limited to exclude the characters than may seem pretty similar in some fonts and thus may cause confusion. Numbers $0..9$ and upper case english letters (except $O$, $I$, $L$, $J$) meets the human friendly criteria:

$$0123456789ABCDEFGHKMNPQRSTUVWXYZ$$

It has $32$ symbols. If $L$ denotes ID length then there are $32^L$ possible words (IDs) of length $L$ in that alphabet.

If an entity representation is stored in a PostgreSQL table with an integer primary key there are not more than $K=2^{31}-1=2147483647$ (see PostgreSQL data types for more details).

From the above follows, for $12$-character IDs the probability to meet a collision during $K=2^{31}-1$ generations will be: $86$% which is pretty high. For $L=13$ the probability is $6$% which may be pretty reasonable for a real use case.

From the other side to represent any of $2^{31}-1$ IDs in a $32$ characters alphabet it's enough to have just $7$ characters length ($2^{31}-1 \lt 32^7 = 2^{35}$) for ID.

This means the ID length needs to be almost doubled in order to avoid collisions during random generations which makes such approach not very human friendly because a 7-characters IDs is obviously easier to type instead of a 13-characters ID.

Bellow an alternative approach is proposed for IDs generation. This approach is collisions free and therefore does not require excess characters so only $7$ characters length is enough in the above context. It also has an additonal benefit: it may be applied to an existing data without data migration.

Algorithm

A sequence of numbers $f(n)$ defined as $f(n) = c \wedge f(n+1) = f(n) + 1$ (corresponds to the PostgreSQL SERIAL/BIGSERIAL data types) obviously does not have collisions. Max value of the PostgreSQL SERIAL data type is $2^{31}-1$ which has length of 10 digits in the decimal notation. In order to reduce the length (for the human friendly requirement) it can be encoded in the 32-characters alphabet defined above as a 7-characters word (6-characters length provides only $32^6=(2^5)^6=2^{30} \lt 2^{31}-1$ which would be not enough). But this would have been violating the requirement of not being predictable for the IDs.

Applying for a sequence a bijective function $R: 1..N \rightarrow 1..N$ that distributes the values of $1..N$ in same range randomly (or pseudo-randomly) would allow to cover the requirement of not being predictable and keep the benefit of not having collisions.

One example of such a function is RSA-encryption. Let's denote $N$ as the total amount of IDs that is considered as enough, $L$ - is the ID length required to encode any ID in range $1..N$ in an alphabet of $X$ characters. Then, in notations of the article, if we chose $n = pq > N \wedge n \le X^L$, then the encryption function will be the bijective function $R: 1..n-1 \rightarrow 1..n-1$.

Let $encode$ will be the function that encodes a number from range $1..2^{31}-1$ in the 32-characters alphabet defined above. Then the ID length in that alphabet is $7$. Let $encrypt$ will be the RSA-encryption function, $decode$, $decrypt$ will be the functions reverse to the functions $encode$, $encrypt$ respectively, $next() \rightarrow 1..N$ the function that produces the sequential numbers.

Then ID generation algorithm is the following: $id = encode(encrypt(next()))$. In order to get the original value $v$ returned by function $next$ there needs to be calculated: $decrypt(decode(id))$.

Consequence 1: This approach may be applied to an existing records in a database withput any DDL: IDs may be calculated "on the fly" during API calls without storing the encrypted IDs in a database.

Note 1: Consequence 1 is applicable only if there is no need to do a prefix search for IDs. In that case because of the randomization IDs would be better to be stored in a database explicitly.

Note 2: The approach proposed uses RSA encryption but this is not the only possible option. Other asymmetric encryptions could also work. The symmetric algorithms (DES, AES) use a certain block sizes (48 bit and more) and because of that could not provide the necessary short ID length ($\le$ 40 bit for not more than 8-characters ID length for int32 primary keys). But may be there are some other symmetric algorithms that can also be suitable for the IDs generation.

Implementation

Python implementation example is provided in this file. It should be executed as a command line script (python src/genid.py) and provides the following subcommands:

inc -- generate RSA-encrypted sequence numbers (IDs)
key -- print possible $(e, d)$ key-paris
primes -- list prime numbers in the given range
rng -- generate random 14-characters IDs, can be used to compare generation performance

This approach can also we applied to an existing database data quite easily: either use Consequence 1, or store the IDs generated in a column, this folder together with Justfile (see Justfile docs for more details about the tool) provides examples for that, to run them execute the following commands:

Run a PostgreSQL container docker compose up -dDuring startup it will create 2 tables in the main database with 10k rows in each. The table demo_labtest is used for integer PK table example, the table demo_assesement is used for uuid PK table example.

Apply database migrations just migrateIt will provide the necessary columns and functions.

Insert more data just insert demo_labtest just insert demo_assesementCheck results just psqlSelect table rows to check.

Note 3: There are several options to choose public and private keys $e, d$. The function $encrypt(x) = x^e \mod{n}$ is used during ID generation when a new record gets inserted. The function $decrypt(y) = y^d \mod{n}$ is used if the ID generated when a record has been inserted is not stored in a database. In such case if the write load is higher than the read load the public key $e$ can be a smaller number, if the read load is higher then the private key $d$ should be a smaller value. If the IDs are sotred in the database then decryption is not required and public key $e$ can be choosen from a small values.

vbogretsov/README.md

ids

Rationale

Algorithm

Implementation

Example