问题
I am trying to use Neo4j to build an MDM. I am just trying to model our customer database with some properties, like email, documentNumber, address, phone, mobilephone and so on.
The problem is that our database is too dirty. For example, I have users with same documentNumber (it is like a ssn.). And when I look to these registries I can see that they are actually the same person.
For discover pattern through relationship I need to dedup/clean records. But I am afraid of loosing information when I dedup the records.
First approach:
<customer>
<name>Maria da Silva</name>
<document>108518037-92</document>
<phone>
<areaCode>21</areaCode>
<number>2247223A<number>
<phone>
</customer>
<customer>
<name>Maria da S.</name>
<document>10851803792</document>
<phone>
<areaCode>21</areaCode>
<number>2247-2236<number>
<phone>
</customer>
So i could store the graph: (using "cypher" language)
person1:Person {name:"Maria da Silva", document:"108518037-92"}
phone1:Phone {areaCode:"21", number:"2247223A"}
person1-[owns]->phone1
person2:Person {name:"Maria da S", document:"10851803792"}
phone2:Phone {areaCode:"21", number:"2247-2236"}
person2-[owns]->phone2
And then I could create a normalized/cleaned nodes:
person_mdm:PersonMdm {name:"MARIA DA SILVA", document:"10851803792"} // now i have to choose a name
phone_mdm:PhoneMdm {areaCode:"21", number:"22472236"} // and choose a phone too
and then link the original nodes to the normalized nodes:
person_mdm-[references]->person1
person_mdm-[references]->person2
phone_mdm-[references]->phone1
phone_mdm-[references]->phone2
person_mdm-[owns]->phone_mdm
Second Approach
Store the mdm nodes with a list of properties holding a hashes. These hashes references a record in other database (MongoDB for example):
person_mdm:PersonMdm {name:"MARIA DA SILVA", document:"10851803792", hash:[XXX, YYY]}
phone_mdm:PhoneMdm {areaCode:"21", number:"22472236", hash: [ZZZ, KKK]}
person_mdm-[owns]->phone_mdm
First approach:
(+) Its simple to implement in comparison of second approach
(+) I will have all nodes in a single database
(-) Number of nodes explosion
(-) Queries more complex
Second approach:
(+) It is clean and simple to query
(-) The MDM information are stored in two different database (maintenance)
(-) Must maintain two separate databases
回答1:
We typically go for first approach. Something along the lines of
person1:Person {name:"Maria da Silva", document:"108518037-92"}
phone1:Phone {areaCode:"21", number:"2247223A"}
person1-[:OWNS]->phone1
person2:Person {name:"Maria da S", document:"10851803792"}
phone2:Phone {areaCode:"21", number:"2247-2236"}
person2-[:OWNS]->phone2
person1-[:SAME_AS]->person2
I wouldn't worry about the number of nodes, as long as you don't have billions. Neo4j can handle a lot of nodes as they have a very small footprint.
Queries get a little more complicated, sure. But on the other hand, you have to do the cleanup/de-duplication somewhere, and doing that at query time ensures you don't lose any of the original information. It also and gives you the flexibility to change/evolve the de-duplication logic, or even have a different one per use-case.
来源:https://stackoverflow.com/questions/34120436/using-neo4j-to-build-a-master-data-management