Microsoft Power BI Cookbook
上QQ阅读APP看书,第一时间看更新

Remove duplicates

The objective of this example is to prevent refresh failures due to duplicate source values in the relationship column of a dimension table. Additionally, the duplicates are to be isolated for further inspection and troubleshooting:

  1. Access a dimension table query from an unstructured data source such as an Excel Workbook.
Figure 71: Product Query Preview
  • The source is an Excel table maintained by business users.
Figure 72: Excel Data Source
  • The Product Name column is used for the relationship to the Sales fact table; therefore it must uniquely identify each row.
Figure 73: Product to Sales Relationship
  • Given the one-to-many relationship, any duplicate values in the Product Name column of the Products table will result in refresh failures.
Figure 74: Duplicate Refresh Failure
  1. Add the following four M expressions to the Products query per Figure 75:
    • Remove any leading and trailing empty spaces in the Product Name column with a Text.Trim() function
    • Create a duplicate column of the Product Name key column with the Table.DuplicateColumn() function
    • Add an expression to the Products query with the Table.Distinct() function to remove duplicate rows
    • Add another Table.Distinct() expression to specifically remove duplicate values from the Product Name column
  • As an unstructured source, the column types were defined explicitly in the query via Table.TransformColumnTypes().
Figure 75: Duplicated key and distinct expressions
  • The query is still vulnerable to mixed cases such as Fender Set and Fender set.
Figure 76: Distinct cases in key values
  • The M engine considers the values unique but the data model engine doesn't.
  1. Add an expression to force uppercase on the Product Name column via the Table.TransformColumns() function. This new expression must be applied before the duplicate removal expressions are applied.
Figure 77: Uppercase expression inserted into the query
  • The query is now resilient to duplicate values and rows, mixed cases, and spaces. However, the Product Name column is now in the uppercase format.
Figure 78: All caps after duplicate removal
  1. Add two final expressions to replace the Product Name column with the duplicate column created in step 2.
Figure 79: Product Name Column Replacement
  • The capitalized Product Name column is dropped via Table.RemoveColumns(), and Table.RenameColumns() is used convert the duplicate column into the column loaded to the data model for the Product-to-Sales relationship.
  1. To support troubleshooting, create a query that accesses the same source table and retrieves the values from the Product Name column with more than one row.
Figure 80: 'Duplicate Products' query expression retrieves the Product Names with more than one row
  • The Product Name column is selected, grouped, and then filtered to always retrieve any duplicate key values. It also accounts for mixed casing of values.
Figure 81: The Duplicates Query
  1. Disable the load of the query retrieving duplicate product names--Duplicate Products in this example.