Designing a Logistics Data Warehouse Schema That Doesn't Break Every Quarter

July 29, 2025 MLPipeLab Engineering Team Architecture

Standard data warehousing guidance — Kimball's dimensional modeling, star schemas, slowly changing dimensions — provides a solid foundation for logistics data warehouse design. But logistics data has specific characteristics that make some of the standard patterns more complicated than they appear in textbook examples. The SCD problem in WMS data, the granularity challenge in shipment fact tables, and the multi-system key management problem all have logistics-specific solutions that the general literature doesn't address. Getting them right at design time prevents the schema rewrites that typically happen 12-18 months into a logistics DW project when the first major discrepancy surfaces.

The Central Design Decision: What Is Your Fact Table's Grain?

In logistics data warehousing, the granularity decision for the shipment fact table is more consequential than in most other domains, because shipments exist at multiple natural levels simultaneously:

The load level: a physical truck or container movement from origin to destination
The shipment level: an order fulfillment event for a specific customer, which may share a load with other shipments
The item level: individual SKUs within a shipment, with their own quantities, weights, and values
The leg level: for multi-stop or intermodal freight, individual transportation legs within a single shipment

The mistake is choosing a single grain that serves all use cases. Cost-per-shipment analysis needs the shipment level. Carrier performance analysis needs the load level. Inventory position analysis needs the item level. Lane analysis needs the leg level. A single-grain fact table that tries to serve all of these will either be unnecessarily granular (bloating storage and query cost) or insufficiently detailed (forcing aggregations that lose nuance).

The right design for most logistics DWs is a set of fact tables at different grains, with well-defined relationships between them: a SHIPMENT_FACT at the shipment level, a SHIPMENT_ITEM_FACT at the item level, and a LOAD_FACT at the load level. Each serves a different set of analytical questions. The cost of this design is more join complexity in BI tool configuration — but this is preferable to the alternative of a single fact table with structural compromises.

The SCD Problem in Logistics: Worse Than Retail

Slowly changing dimensions (SCDs) are a well-documented data warehousing challenge. In retail, the canonical example is a product changing its category or description — moderate frequency, relatively contained blast radius. In logistics, the SCD challenge is more severe for two reasons.

First, location data changes more frequently. Warehouse locations are assigned, reassigned, repurposed, and decommissioned regularly in active distribution centers. A slotting optimization project might reassign 30% of storage locations in a weekend. If your location dimension doesn't track these changes with proper SCD-2 versioning (effective dates, current flag), historical pick task records will join to the wrong location attributes — calculating travel distance metrics or putaway zone compliance for historical periods will produce incorrect results.

Second, carrier and rate information changes under contracts. Carrier lane rates, fuel surcharges, and accessorial fee schedules change when contracts renew — often annually, sometimes mid-year. If your carrier rate dimension uses type-1 SCD (overwrite on change), historical freight cost calculations become incorrect after every rate update. For a business running $40M+ in annual freight, this matters significantly for any period-over-period cost analysis.

The recommendation: apply SCD-2 to location dimensions, carrier dimensions, and item dimensions with meaningful change cadences. Apply SCD-1 only to dimensions where historical accuracy genuinely doesn't matter — geographic reference data, for example. The operational overhead of SCD-2 is real but manageable; the analytical errors from SCD-1 on the wrong dimensions are recurring and expensive.

The Multi-System Key Management Problem

A logistics data warehouse typically integrates from 3-5 source systems with independent key spaces. A shipment exists as:

A shipment number in the WMS (warehouse-generated, resets at year end in some systems)
A delivery number in the ERP (SAP SD delivery number, Oracle outbound delivery)
A PRO number in the carrier's EDI feed (carrier-assigned, not WMS-controlled)
An order number in the order management system (customer-visible reference)

These four key types refer to the same physical event but are maintained independently by different systems. In a data warehouse that needs to answer questions like "what was the cost and on-time delivery rate for this customer's orders last quarter," you need a shipment identity table that bridges all four key types for every shipment.

This identity table is often called a golden record or entity resolution table. It's built by matching keys across systems using overlapping reference fields: order number present in both WMS and ERP, delivery date and ship-to location matching WMS shipment to carrier PRO number, etc. The match logic needs to handle the cases where keys don't perfectly align — partial matches, date discrepancies from timezone issues, or PRO numbers that were reassigned after a carrier billing correction.

For data quality monitoring, the match rate for this identity table is one of the most important metrics to track. A declining match rate (say, from 97% to 93%) indicates one of the source systems is changing its key generation or that a new integration pattern is producing unmatched records. As discussed in our article on orphaned shipment records, unmatched records in this table are the root cause of the "shipments without matching orders" problem that distorts cost-per-shipment calculations.

Handling Freight Cost Attribution

Freight cost is one of the most analytically important fields in a logistics DW and one of the most difficult to model correctly. The challenge: freight cost is typically known at the load level (one carrier invoice for one load), but it needs to be attributed at the shipment level (each customer's orders) and sometimes at the item level (for landed cost calculations).

The standard allocation approach — distribute load cost across shipments by weight, cube, or flat split — sounds simple but has several complications:

Multi-stop loads have partial cost attribution: shipments dropped at stop 1 should bear less cost than shipments delivered at stop 3, all else equal
Accessorial charges (liftgate, inside delivery, residential surcharge) are shipment-specific, not load-level, and need to be attributed separately
Carrier invoice errors and credits change the total cost after initial attribution — requiring retroactive reallocation of historical shipment costs

The practical recommendation: store the raw carrier invoice in a FREIGHT_INVOICE fact table at the invoice line level. Store the attributed cost in SHIPMENT_FACT as a derived field with a clear lineage to the allocation logic version used. When attribution logic changes (new contract terms, new carrier), record the version change so historical reports can be regenerated with the correct attribution for any period.

Date Dimension Considerations for Logistics

The standard date dimension (year, month, quarter, day of week, holiday flag) covers most analytical needs but misses several logistics-specific date attributes that significantly affect analysis:

Business days in transit — carrier transit time calculations use business days, not calendar days. Carrier-specific holiday schedules differ from US federal holidays.
Peak season flags — retail-adjacent logistics has pronounced seasonality. Marking Black Friday week, pre-Christmas peak, and Q4 generally allows easy peak vs. non-peak comparisons.
Carrier network disruption days — severe weather events, labor actions, and major outages affect carrier performance metrics for those days and skew SLA compliance calculations if not flagged.

Adding these attributes to the date dimension at design time is a few hours of work. Adding them retroactively after discovering that your carrier SLA compliance report is inflated because it doesn't exclude weather event days is a multi-week effort plus explaining the data revision to business stakeholders.

Conclusion

A well-designed logistics data warehouse schema is characterized by explicit grain decisions (multiple fact tables at appropriate levels), SCD-2 implementation on dimensions with meaningful change cadences, a robust entity resolution table for cross-system key bridging, and logistics-specific date attributes in the time dimension. These decisions require more upfront design time than a simple star schema built from a single WMS export — but they produce a foundation that supports accurate analytics without recurring schema rewrites as the business evolves.

The teams that invest in this design thinking early consistently spend less time debugging data quality issues and more time using their logistics data to make actual operational decisions.

MLPipeLab delivers clean, normalized logistics data to Snowflake, BigQuery, or Redshift with a pre-designed schema appropriate for logistics analytics. Request a demo to see the target schema for your source systems.

Back to Blog