Configure Entity Values for Bronze Layer Entities in NCC¶

Learn how to configure Entity Values for Bronze layer entities in NCC to optimize data quality and JSON parsing. This article provides guidance on deduplication strategies and working with JSON data using Entity Values.

Introduction¶

Bronze entities in NCC can be customized with Entity Values to support a variety of scenarios. For a complete list of available Entity Values, see Entity Values Reference.

Manage Data Quality¶

Control duplicate data during source file ingestion by configuring deduplication options for Bronze layer entities.

Deduplication Modes¶

Set the DQ_Deduplication_Mode parameter to define the deduplication strategy. Supported modes include:

None
All rows from the source file are loaded without deduplication.
Row
Only unique rows are retained based on the entire row content.
Key
Deduplication is performed using primary key column(s). Only the first occurrence of each key is kept.

TIP
Choose the deduplication mode that best matches your data quality requirements and the structure of your source data.

Deduplication modes illustration

Deduplication Mode Examples¶

None¶

name	age	height	name	age	height
Alice	5	80	Alice	5	80
Alice	5	80	Alice	5	80
Alice	10	80	Alice	10	80

All rows are loaded, including duplicates.

Row¶

name	age	height	name	age	height
Alice	5	80	Alice	5	80
Alice	5	80	Alice	10	80
Alice	10	80

Only unique rows are retained.

Key (Name is the key column)¶

name	age	height	name	age	height
Alice	5	80	Alice	5	80
Alice	5	80
Alice	10	80

IMPORTANT
When enabling the Key option, make sure that your primary keys are correct. Otherwise, you may lose data extracted from the source system.

Only the first row for each key value is kept.

For more details on Bronze layer entity configuration, see NCC documentation.

Parse JSON Data with Entity Values¶

Entity Values enable parsing and transforming JSON data in Microsoft Fabric. This section describes how to use the Collection and DateFormat Entity Values for effective data processing.

Collection Entity Value¶

Use the Collection Entity Value to specify an array from a Landing zone entity's JSON. Arrays are ordered collections of values, defined by square brackets ([]) and separated by commas.

Example JSON array:

["2024-01-01T13:00:00.000", "2024-02-01T13:00:00.000"]

To access collection members, use the collection name, followed by a period (.), and then the member name.

Example JSON Bronze Layer Entity with Collection

Tip:
Use backticks ( ` ) to separate the collection and its members.
Example: billingAddress`.`street

If you wrap the period in backticks, it references a dictionary key:

{"@odata": {"context": "http://services.odata.org"}}

Otherwise, it references a key with a period in its name:

{"@odata.context": "http://services.odata.org"}

Data Processing with `collection` and `column_mapping`¶

Configure the collection and column_mapping settings to process JSON files using PySpark. These settings help transform nested JSON structures into a flat DataFrame.

The collection field specifies JSON arrays to explode into rows for lower granularity.
For multiple arrays:
- Nested arrays: separate with a semicolon (;)
- Non-nested arrays: use different Bronze entities from the same Landing zone entity

Example JSON:

{
    "id": 1,
    "name": "John Doe",
    "orders": [
        {
            "order_id": 101,
            "amount": 250
        },
        {
            "order_id": 102,
            "amount": 450
        }
    ]
}

To explode the orders array, set collection to orders.

Column Mapping¶

The column_mapping field defines how JSON columns map to DataFrame or Delta Parquet columns. Use it to rename columns or select specific fields. Each column name should be unique.

Example JSON:

{
    "id": 1,
    "name": "John Doe",
    "billingAddress": {
        "street": "123 Main St",
        "city": "Anytown"
    },
    "shippingAddress": {
        "street": "123 Main St",
        "city": "Anytown"
    }
}

Resulting column mapping:

[
    {"source": "id", "target": "user_id"},
    {"source": "name", "target": "full_name"},
    {"source": "`billingAddress`.`street`", "target": "billingAddress_street"},
    {"source": "`billingAddress`.`city`", "target": "billingAddress_city"},
    {"source": "`shippingAddress`.`street`", "target": "shippingAddress_street"},
    {"source": "`shippingAddress`.`city`", "target": "shippingAddress_city"}
]

Nested arrays example:

{
    "id": 1,
    "name": "John Doe",
    "orders": [
        {
            "order_id": 101,
            "amount": 250,
            "items": [
                {"item_id": 1, "product": "Book"},
                {"item_id": 2, "product": "Pen"}
            ]
        },
        {
            "order_id": 102,
            "amount": 450,
            "items": [
                {"item_id": 3, "product": "Notebook"}
            ]
        }
    ]
}

Resulting column mapping:

[
    {"source": "id", "target": "user_id"},
    {"source": "name", "target": "full_name"},
    {"source": "`orders`.`order_id`", "target": "order_id"},
    {"source": "`orders`.`amount`", "target": "order_amount"},
    {"source": "`orders`.`items`.`item_id`", "target": "item_id"},
    {"source": "`orders`.`items`.`product`", "target": "product_name"}
]

Configuration example:

collection = "orders;orders.items"
column_mapping = [
    {"source": "id", "target": "user_id"},
    {"source": "name", "target": "full_name"},
    {"source": "orders.order_id", "target": "order_id"},
    {"source": "orders.amount", "target": "order_amount"},
    {"source": "orders.items.item_id", "target": "item_id"},
    {"source": "orders.items.product", "target": "product_name"}
]

NOTE
Ensure the collection field references arrays in your JSON structure.
The column_mapping should accurately map source fields to target fields in the DataFrame.

DateFormat Entity Value¶

The DateFormat Entity Value allows you to specify a custom date format for all datetime columns in a JSON entity.

Example JSON Bronze Layer Entity with DateFormat

By default, dates use the ISO-8601 format. You can change this to another format, such as "dd-mm-yyyy hh:mm:ss", to match regional preferences.

Note:
The specified format applies to all datetime columns in the JSON entity.