Matching rows between two columns to get an exact or partial match

ぐ巨炮叔叔 提交于 2020-04-18 12:37:07

问题


I’m having a problem writing a query in power bi for matching two rows partially or completely. In addition to that I’m trying to calculate the match percentage if found and put the results in a new column.

The actual dataset contains a lot of rows and tables but for the sake of this example I’m using only 4 columns.

The columns 'ID' and 'Text' are uniquely identified. 'KI ID' and 'KI Test' are also unique but they are not related to columns 'ID' and 'Text' only when a match occurs.

What I need to implement is the following:

I would like to match the input of each row in the 'Text' column with each row in the 'KI text' column. If there is a match, then I would like to know the 'KI ID’ and the Match percentage. Take a look at the data set for a better insight.

ps: Is this actually achievable with power query or is it just a fantasy because in my perspective I’m heading towards machine learning, I think?

Data set https://drive.google.com/open?id=1JrsxPa6DICNi5N-tI5ESukh62W_uedaQ

enter image description here

The match calculation is based on the amount of words that occurs in both columns, 'Text' and 'KI Text'. for example, if one of the rows in the 'Text' column contains two sentences and these sentences partially matches with one of the 'KI Text' rows which has like 6 sentences in total. The match between the rows is partially so basically it should calculate it as 2/6 so it's like 33,3% match.

In addition, the 'KI Text' column contains a lot of rows that could passably match with one of the ‘Text' column rows. Only if it's greater or equals to 80% then it should show the results otherwise it’s not interesting.


回答1:


Hopefully, this is what you're looking for, or at least brings you closer to an answer.

While I believe your ID / Text and KI ID / KI Text data most likely come from two different tables, you presented them in one spreadsheet as your data set, so I started with that. The only content available to me in your spreadsheet was Sheet2. I simply copied and pasted the contents of your spreadsheet's Sheet2 to my own spreadsheet in Excel. I then used the default table name of Table1 to refer to it. I brought it in just like you presented it:

Then, still working within the same query, I separated it into two new tables: an ID table and a KI ID table. You can make the first table by selecting the existing ID and Text columns, then Home > Remove Columns > Remove Other Columns; but to make the second table, you would need to use the formula bar.

Before I made the second table, I added a column to the first table (the ID table) with the number 1 in every row. I called that column ID Match Key.

Then I copied the first table's formula from the applied step where I had created it, to use it in the formula bar to create the second table for KI ID. I edited it, changing the column's names as appropriate for KI ID:

Then I added a column with the number 1 in every row to the second table (the KI ID table). I called that column KI ID Match Key.

Then I did a full outer merge of the two tables that I had just made--the ID and KI ID tables. To do this, I first used Home > Merge Queries and merged Table1 with itself. (Which column to use for a match doesn't matter, because this is temporary.) I selected Full Outer as the Join Kind. Once the merge was done, I edited the merge in the formula bar, to change the tables to #"Added ID Match Key" and #"Added KI ID Match Key" (which happen to be the names of the those two "tables" after I created them and added their match keys to use in this merge) and the respective matching fields to "ID Match Key" and "KI ID Match Key":

After the merge, I expanded the resultant column:

Then I replaced all nulls in both the Text and KI Text columns with blank text by selecting both columns (by clicking one column, then clicking and holding Ctrl) then clicking Transform > Replace Values > typing null in the "Value To Find" box and leaving the "Replace With" box blank, and clicking OK.

Then I added columns to split each of the Text and KI Text columns cells into lists of their words. Before I did each split though, I first filtered out everything except lowercase and capital text a through z, numbers 0 through 9, spaces, and equal signs. I used Text.SplitAny to split the words at any spaces or equals signs:

Then I added a column and determined what words from Text were in KI Text. I used List.Intersect to do that. It lists the duplicates, which is something I wanted:

Then I added a column and did the reverse--determined what words from KI Text were in Text. Again, I used List.Intersect:

Then I collected count information. I created a new column with the count of the number of words (all occurrences) in each of the lists I had created. I did this once each for Text, KI Text, Text in KI Text, and KI Text in Text. I simply used List.Count to do this. This wasn't actually necessary to do as separate steps with separate columns. I just did it to clearly see the numbers. The M code I generated in the two columns that follow these four columns doesn't use these four columns. I did the counting in those next two columns as well:

Next, I created two new columns to calculate the % of Text in KI Text and the % of KI Text in Text. I used these basic formulas (presented here in, hopefully, clear english, versus the M code):

  • % Text in KI Text = (Number of words from Text that are in KI Text / Number of words in KI Text) * 100
  • % KI Text in Text = (Number of words from KI Text that are in Text / Number of words in Text) * 100

    Be sure to check my math.

Last, I removed all the columns I no longer wanted.

After your initial three comments, I added some steps to my query to get this:

Here's the M code with the edits that I made after your three comments. See my comments for more context.:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", type text}, {"Text", type text}, {"KI ID", type text}, {"KI Text", type text}, {"Outcome", type text}}),
    #"Made ID Table" = Table.SelectColumns(#"Changed Type",{"ID", "Text"}),
    #"Added ID Match Key" = Table.AddColumn(#"Made ID Table", "ID Match Key", each 1),
    #"Made KI ID Table" = Table.SelectColumns(#"Changed Type",{"KI ID", "KI Text"}),
    #"Added KI ID Match Key" = Table.AddColumn(#"Made KI ID Table", "KI ID Match Key", each 1),
    #"Merged Queries" = Table.NestedJoin(#"Added ID Match Key", {"ID Match Key"}, #"Added KI ID Match Key", {"KI ID Match Key"}, "KI Table", JoinKind.FullOuter),
    #"Expanded KI Table" = Table.ExpandTableColumn(#"Merged Queries", "KI Table", {"KI ID Match Key", "KI ID", "KI Text"}, {"KI ID Match Key", "KI ID", "KI Text"}),
    #"Replaced Value" = Table.ReplaceValue(#"Expanded KI Table",null,"",Replacer.ReplaceValue,{"Text", "KI Text"}),
    #"Split Text to List" = Table.AddColumn(#"Replaced Value", "Text Listed", each Text.SplitAny(Text.Select([Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Split KI Text to List" = Table.AddColumn(#"Split Text to List", "KI Text Listed", each Text.SplitAny(Text.Select([KI Text],{"a".."z","A".."Z","0".."9"," ","="})," =")),
    #"Got Text in KI Text" = Table.AddColumn(#"Split KI Text to List", "Text in KI Text", each List.Intersect({[KI Text Listed],[Text Listed]})),
    #"Got KI Text in Text" = Table.AddColumn(#"Got Text in KI Text", "KI Text in Text", each List.Intersect({[Text Listed], [KI Text Listed]})),
    //You don't actually need the next four lines. I included them so you can see the numbers in the table. They are calculated in the two #"Calculated..." lines below.
    //If you choose to remove them, you'll need to replace #"Got Count of KI Text in Text", in the first #"Calculated..." line, with #"Got KI Text in Text".
    #"Got Count of Text" = Table.AddColumn(#"Got KI Text in Text", "Text Count", each List.Count([Text Listed])),
    #"Got Count of KI Text" = Table.AddColumn(#"Got Count of Text", "KI Text Count", each List.Count([KI Text Listed])),
    #"Got Count of Text in KI Text" = Table.AddColumn(#"Got Count of KI Text", "Text in KI Text Count", each List.Count([Text in KI Text])),
    #"Got Count of KI Text in Text" = Table.AddColumn(#"Got Count of Text in KI Text", "KI Text in Text Count", each List.Count([KI Text in Text])),
    #"Calculated % Text in KI Text" = Table.AddColumn(#"Got Count of KI Text in Text", " % Text in KI Text", each Number.Round((List.Count([Text in KI Text])/List.Count([KI Text Listed]))*100, 2), type number),
    #"Calculated % KI Text in Text" = Table.AddColumn(#"Calculated % Text in KI Text", "% KI Text in Text", each Number.Round((List.Count([KI Text in Text])/List.Count([Text Listed]))*100, 2), type number),
    #"Removed Other Columns" = Table.SelectColumns(#"Calculated % KI Text in Text",{"ID", "Text", "KI ID", "KI Text", " % Text in KI Text", "% KI Text in Text"}),
    //
    //What follows is my edit after your three comments.
    #"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Outcome", each Text.From([#" % Text in KI Text"]) & "% match with " & [KI ID]),
    #"Grouped Rows" = Table.Group(#"Added Custom", {"ID"}, {{"AllData", each _, type table [ID=text, Text=text, KI ID=text, KI Text=text, #" % Text in KI Text"=number, #"% KI Text in Text"=number, Outcome=text]}}),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "Text", each [AllData][Text]{0}),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "Outcome", each [AllData][Outcome]),
    #"Extracted Values" = Table.TransformColumns(#"Added Custom2", {"Outcome", each Text.Combine(List.Transform(_, Text.From), "#(cr)"), type text}),
    #"Removed Other Columns1" = Table.SelectColumns(#"Extracted Values",{"ID", "Text", "Outcome"})
in
    #"Removed Other Columns1"


来源:https://stackoverflow.com/questions/60617568/matching-rows-between-two-columns-to-get-an-exact-or-partial-match

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!