The schema decisions made in week one create technical debt that compounds for years.
Standard data warehousing guidance — Kimball's dimensional modeling, star schemas, slowly changing dimensions — provides a solid foundation for logistics data warehouse design. But logistics data has specific characteristics that make some of the standard patterns more complicated than they appear in textbook examples. The SCD problem in WMS data, the granularity challenge in shipment fact tables, and the multi-system key management problem all have logistics-specific solutions that the general literature doesn't address. Getting them right at design time prevents the schema rewrites that typically happen 12-18 months into a logistics DW project when the first major discrepancy surfaces.
In logistics data warehousing, the granularity decision for the shipment fact table is more consequential than in most other domains, because shipments exist at multiple natural levels simultaneously:
The mistake is choosing a single grain that serves all use cases. Cost-per-shipment analysis needs the shipment level. Carrier performance analysis needs the load level. Inventory position analysis needs the item level. Lane analysis needs the leg level. A single-grain fact table that tries to serve all of these will either be unnecessarily granular (bloating storage and query cost) or insufficiently detailed (forcing aggregations that lose nuance).
The right design for most logistics DWs is a set of fact tables at different grains, with well-defined relationships between them: a SHIPMENT_FACT at the shipment level, a SHIPMENT_ITEM_FACT at the item level, and a LOAD_FACT at the load level. Each serves a different set of analytical questions. The cost of this design is more join complexity in BI tool configuration — but this is preferable to the alternative of a single fact table with structural compromises.
Slowly changing dimensions (SCDs) are a well-documented data warehousing challenge. In retail, the canonical example is a product changing its category or description — moderate frequency, relatively contained blast radius. In logistics, the SCD challenge is more severe for two reasons.
First, location data changes more frequently. Warehouse locations are assigned, reassigned, repurposed, and decommissioned regularly in active distribution centers. A slotting optimization project might reassign 30% of storage locations in a weekend. If your location dimension doesn't track these changes with proper SCD-2 versioning (effective dates, current flag), historical pick task records will join to the wrong location attributes — calculating travel distance metrics or putaway zone compliance for historical periods will produce incorrect results.
Second, carrier and rate information changes under contracts. Carrier lane rates, fuel surcharges, and accessorial fee schedules change when contracts renew — often annually, sometimes mid-year. If your carrier rate dimension uses type-1 SCD (overwrite on change), historical freight cost calculations become incorrect after every rate update. For a business running $40M+ in annual freight, this matters significantly for any period-over-period cost analysis.
The recommendation: apply SCD-2 to location dimensions, carrier dimensions, and item dimensions with meaningful change cadences. Apply SCD-1 only to dimensions where historical accuracy genuinely doesn't matter — geographic reference data, for example. The operational overhead of SCD-2 is real but manageable; the analytical errors from SCD-1 on the wrong dimensions are recurring and expensive.
A logistics data warehouse typically integrates from 3-5 source systems with independent key spaces. A shipment exists as:
These four key types refer to the same physical event but are maintained independently by different systems. In a data warehouse that needs to answer questions like "what was the cost and on-time delivery rate for this customer's orders last quarter," you need a shipment identity table that bridges all four key types for every shipment.
This identity table is often called a golden record or entity resolution table. It's built by matching keys across systems using overlapping reference fields: order number present in both WMS and ERP, delivery date and ship-to location matching WMS shipment to carrier PRO number, etc. The match logic needs to handle the cases where keys don't perfectly align — partial matches, date discrepancies from timezone issues, or PRO numbers that were reassigned after a carrier billing correction.
For data quality monitoring, the match rate for this identity table is one of the most important metrics to track. A declining match rate (say, from 97% to 93%) indicates one of the source systems is changing its key generation or that a new integration pattern is producing unmatched records. As discussed in our article on orphaned shipment records, unmatched records in this table are the root cause of the "shipments without matching orders" problem that distorts cost-per-shipment calculations.
Freight cost is one of the most analytically important fields in a logistics DW and one of the most difficult to model correctly. The challenge: freight cost is typically known at the load level (one carrier invoice for one load), but it needs to be attributed at the shipment level (each customer's orders) and sometimes at the item level (for landed cost calculations).
The standard allocation approach — distribute load cost across shipments by weight, cube, or flat split — sounds simple but has several complications:
The practical recommendation: store the raw carrier invoice in a FREIGHT_INVOICE fact table at the invoice line level. Store the attributed cost in SHIPMENT_FACT as a derived field with a clear lineage to the allocation logic version used. When attribution logic changes (new contract terms, new carrier), record the version change so historical reports can be regenerated with the correct attribution for any period.
The standard date dimension (year, month, quarter, day of week, holiday flag) covers most analytical needs but misses several logistics-specific date attributes that significantly affect analysis:
Adding these attributes to the date dimension at design time is a few hours of work. Adding them retroactively after discovering that your carrier SLA compliance report is inflated because it doesn't exclude weather event days is a multi-week effort plus explaining the data revision to business stakeholders.
A well-designed logistics data warehouse schema is characterized by explicit grain decisions (multiple fact tables at appropriate levels), SCD-2 implementation on dimensions with meaningful change cadences, a robust entity resolution table for cross-system key bridging, and logistics-specific date attributes in the time dimension. These decisions require more upfront design time than a simple star schema built from a single WMS export — but they produce a foundation that supports accurate analytics without recurring schema rewrites as the business evolves.
The teams that invest in this design thinking early consistently spend less time debugging data quality issues and more time using their logistics data to make actual operational decisions.
MLPipeLab delivers clean, normalized logistics data to Snowflake, BigQuery, or Redshift with a pre-designed schema appropriate for logistics analytics. Request a demo to see the target schema for your source systems.