Normalization process of a data model

Normalization is a process of modelling data and organizing it into tables. The goal of normalization is a high quality data model.
The motivation for Normalization is to avoid unwanted redundancies in the table design, and eliminate modification anomalies.
There are use cases when you strongly need normalization (transaction processing db) and when you avoid it (business intelligence, data warehouse – since the update anomalies typically do not concern those and query performance is a priority), and also cases in between (when you are ok with having some joins in your query but not too many).

There’s evidence that it’s better to have normalization done as a refinement rather than the initial data model design.

The database will be normalized if you follow the modeling steps correctly, let’s sum them up briefly.

First let’s take care of some terminology

  • Functional dependency in DBMS – it’s a relationship between the attributes. It means that one or several attributes (called determinants – e.g. the Primary Key) determine the value of another attribute of an entity. An example would be your iPhone’s IMEI determining the phone attributes like model, manufacturer, screen size etc. Or – given the stock symbol + exchange id you know the volume, bid/ask etc. so the symbol and exchange combination is the determinant, and volume is functionally dependent on it.
    Note that the determinant is not always as unique as a key, it can have repeat values. An example would be “fee structure” functional dependency – “Hotel Room Type: Double” determining the price per night – but you can have many Double rooms in the hotel.
    There’s also a case when the dependency is multideterminational (aka multivalued), for example country code determines provinces. If you know the country code, you will know the list of provinces in it.
  • Candidate Key – an attribute, or a set of attributes that uniquely identifies the rest of the row attributes. An example is iPhone’s IMEI and Serial Number – both uniquely identifying the phone. only one of them may be selected as a PK – thus they are called Candidate Keys.
  • Super Key – a set of attributes that uniquely identify the rest of the row attributes. Similar to Candidate Key, but Superkey will have more than just the minimum set of attributes required to uniquely identify each row
    Example: Counterparty ID + Counterparty Legal Name + Counterparty Location superkey vs Conterparty ID (candidate key), Counterparty Legal Name (Candidate Key)

Normal forms: 1NF, 2NF, 3NF, DKNF

First Normal Form (1NF) is when a relation contains only single valued rows.

While the original definition by Codd was a bit free to interpretation, what this means in simple terms is that all rows should have the same number of columns, and each row/column intersection (attribute) should contain a single value, not a set of values.

An example below shows a non-1nf relation and then a fixed version of the model.
Interest rate swap with schedule dates in one field.

Before normalization:

IR Swap IDCashflow date
12342015-02-01, 2015-03-01, 2015-04-01
12352016-02-01, 2016-03-01

After normalization:

IR Swap IDCashflow date
12342015-02-01
12342015-03-01
12342015-04-01
12352016-02-01
12352016-03-01

Second Normal Form (2NF) is when it’s 1NF and all of the relation’s attributes have functional dependency on the WHOLE primary key.

If there’s an attribute that’s dependent on part of the key, the relation is not in 2NF.

An example: you have a Position table with security ID and portfolio ID as a PK, and Portfolio Head attribute – which depends only on the portfolio ID.

This is an anomaly since a change to the Portfolio Head will potentially cause consistency problems with the data, and you have to be careful to update all relevant rows.

Portfolio IDCUSIPPortfolio Head
PortfolioA912796WL9Anna
PortfolioA912810EN4Anna
PortfolioB912810EV6Julia

A better model is to externalize the portfolio information into a separate table.

Portfolio IDPortfolio Head
PortfolioAAnna
PortfolioAAnna
PortfolioBJulia

Third Normal Form (3NF) is when a relation is in 2NF and there are no columns that depend on a non-key column.

Example: you have a counterparty, with Legal Entity and Party ID as a PK, and attributes like Country of Incorporation, Country Currency. If a country changes its currency, you will have to carefully update many rows. Or if a country has more than one currency (e.g. CUC, Cuban convertible peso) you would need to create a second row for the same counterparty, just to keep the relation in the 1NF. A better solution is to create a separate “Country” table that has 1:M relationship with the counterparty table.

  • Non-3NF:
Legal EntityParty IDCountry of IncorporationCountry Currency
US NY123USAUSD
UK London345UKGBP
UK London678UKGBP
UK London678PanamaUSD
  • 3NF: Two tables, one for counterparty and one for country
Legal EntityParty IDCountry of Incorporation
US NY123USA
UK London345UK
UK London678UK
UK London678Panama
CountryCountry Currency
USAUSD
UKGBP
PanamaUSD

Boyce-Codd Normal Form (BCNF, aka 3.5NF) is a stronger version of 3NF.

This form was created when it became obvious that 3NF is not 100 perfect.

There’s a rare case when 3NF doesn’t work, when the table has several composite candidate keys, and their columns overlap, or when the relation describes a non-trivial entity relationship.

BCNF DEFINITION

Relation is in BCNF when for each functional dependency the determinant is the super key of the table (if X -> Y then X is a Super Key)
Or: BCNF means that every determinant is unique in the table. If there’s a determinant in an FD tha’s not unique – the relation is not in BCNF form.

Example: This table is in violation of BCNF. It has Trade ID as a PK, and if we review FDs we will spot the violation of BCNF.

Trade idExchangeProductCommodityDateTime
12345CMEWheat FuturesWheat
12346CMECorn FuturesCorn
12347CMEMilk OptionsMilk
12348LMECopper TAPOGrade A Copper
12349LMEAluminium FutureAl99.70 Aluminium
12350CMECorn FuturesCorn

FD in question: (Exchange, Product) -> Commodity
The determinant here is not a superkey since it has no Trade ID attribute (recall that PK is Trade ID)
And if we use the second definition we also find that it’s violated: (Exchange, Product) determinant is not unique – trades 123145 and 12350 have the same value for these attributes.

To bring this table to BNCF we would need to externalize Commodity into a new separate entity.

3 higher normal forms: 4NF, 5NF, anf DKNF

There are also 3 higher normal forms: 4NF, 5NF, anf DKNF.
They are less frequently used in business since it’s hard to find a use case tables that violate these forms.
The definition of these forms is below.

Fourth Normal Form

The relation is in 4NF if it there are no rows that contain two or more independent multivalues facts about an entity. If they do – the relation needs to be split.

Fifth Normal Form

The relation is in 5NF if every join dependency of the relation is a consequence of the candidate keys of the relation.

Domain key/Normal form (DKNF)

The relation is in domain key/normal form if each constraint on the relation is a logical consequence of the domain constraints and the key constraints of the relation.

In practice you wouldn’t always want to normalize your relations, and thus it leads to DW approaches like Star schema.

Star schema

Star schema is a dimensional structure that is characterized by having fact tables linked to associated dimensional tables linked by PK/FK relationships. Star schema offer performance benefits and query simplicity compared to the high levels of normalization, at the cost of update anomalies – but in data warehouses the rows are not typically updated, so it’s a fair trade-off. Additionally, createing highly normalized schema for the ETL jobs convenience may not be worth the time if the data needs to be denormalized at the DW level anyway.

Well-designed star schema can be extended easily to add more dimensions by simply adding FKs to the facts table; attributes can be added to the dimensions without affecting facts; fact attributes of the same granularity level can be added as new coulmns of the fact table etc.
As mentioned earlier, in data warehouse query convenience is more important than disk space, so redundancy is acceptable, and dimensions are denormalized to have redundancy attributes.

Snowflake schema

Snowflake schema is a further extension of the star schema, where dimension tables are normalized, creating characterisic ERD that looks like a snowflake.
Snowflake schemas are not very easy for users to understand and it does call for many joins in fairly simple business queries.