Using Neo4j to build a Master Data Management

问题

I am trying to use Neo4j to build an MDM. I am just trying to model our customer database with some properties, like email, documentNumber, address, phone, mobilephone and so on.

The problem is that our database is too dirty. For example, I have users with same documentNumber (it is like a ssn.). And when I look to these registries I can see that they are actually the same person.

For discover pattern through relationship I need to dedup/clean records. But I am afraid of loosing information when I dedup the records.

First approach:

<customer>
    <name>Maria da Silva</name>
    <document>108518037-92</document>
    <phone>
        <areaCode>21</areaCode>
        <number>2247223A<number>
    <phone>
</customer>

<customer>
    <name>Maria da S.</name>
    <document>10851803792</document>
    <phone>
        <areaCode>21</areaCode>
        <number>2247-2236<number>
    <phone>
</customer>

So i could store the graph: (using "cypher" language)

person1:Person {name:"Maria da Silva", document:"108518037-92"}
phone1:Phone {areaCode:"21", number:"2247223A"}
person1-[owns]->phone1

person2:Person {name:"Maria da S", document:"10851803792"}
phone2:Phone {areaCode:"21", number:"2247-2236"}
person2-[owns]->phone2

And then I could create a normalized/cleaned nodes:

person_mdm:PersonMdm {name:"MARIA DA SILVA", document:"10851803792"} // now i have to choose a name
phone_mdm:PhoneMdm {areaCode:"21", number:"22472236"} // and choose a phone too

and then link the original nodes to the normalized nodes:

person_mdm-[references]->person1
person_mdm-[references]->person2

phone_mdm-[references]->phone1
phone_mdm-[references]->phone2
person_mdm-[owns]->phone_mdm

Second Approach

Store the mdm nodes with a list of properties holding a hashes. These hashes references a record in other database (MongoDB for example):

person_mdm:PersonMdm {name:"MARIA DA SILVA", document:"10851803792", hash:[XXX, YYY]}
phone_mdm:PhoneMdm {areaCode:"21", number:"22472236", hash: [ZZZ, KKK]} 
person_mdm-[owns]->phone_mdm

First approach:

(+) Its simple to implement in comparison of second approach

(+) I will have all nodes in a single database

(-) Number of nodes explosion

(-) Queries more complex

Second approach:

(+) It is clean and simple to query

(-) The MDM information are stored in two different database (maintenance)

(-) Must maintain two separate databases

回答1:

We typically go for first approach. Something along the lines of

person1:Person {name:"Maria da Silva", document:"108518037-92"}
phone1:Phone {areaCode:"21", number:"2247223A"}
person1-[:OWNS]->phone1

person2:Person {name:"Maria da S", document:"10851803792"}
phone2:Phone {areaCode:"21", number:"2247-2236"}
person2-[:OWNS]->phone2

person1-[:SAME_AS]->person2

I wouldn't worry about the number of nodes, as long as you don't have billions. Neo4j can handle a lot of nodes as they have a very small footprint.

Queries get a little more complicated, sure. But on the other hand, you have to do the cleanup/de-duplication somewhere, and doing that at query time ensures you don't lose any of the original information. It also and gives you the flexibility to change/evolve the de-duplication logic, or even have a different one per use-case.

来源：https://stackoverflow.com/questions/34120436/using-neo4j-to-build-a-master-data-management

标签

neo4j

graph-databases

master-data-management