Constraint database

痞子三分冷 提交于 2020-02-03 04:15:31

问题


I know the intuition behind constraint programming, so to say I never really experienced programming using a constraint solver. Although I think it is a different situation to be able to achieve what we would define as consistent data.

Context:

We have a set of rules to implement on a ETL server. These rules are either:

  • acting on one row.
  • acting inter-rows, in one or different tables.
  • acting the same way between two runs (It should maintain the same constraint on all data, or just the last n runs);

The third case is different from the second, as it holds when the 2nd case holds but for a well defined number of runs. It might be applied for one single run (one file), or between (1 to n (previous) or on All files).

Technically as we conceived the ETL, it has no memory between two runs: two files (but this is to be re-thought)

For the application of the third kind of rule, ETL needs to have memory (I think we would end-up back-upping data in ETL); Or by re-checking infinitely (a Job) on the whole database after some time window, So data ending up in database do not necessarily fulfill the third kind of rule in-time.

Example:

While we have a continuous flowing data, we apply constraints to have a whole constrained database, the next day we will receive a backup or a correction data for say one month, for this time window, we would like to have constraints satisfied for only this run (this time window), without worrying about the whole database, for future runs all data should be constrained like before without worrying about past data. You can imagine other rules that could fit Temporal logic.

For now, we only have the first kind of rules implemented. The way I thought of it is to have a minified database (of any kind: MySQL, PostgreSQL, MongoDB ...) that back-up all Data (only constrained columns, probably with hashed values) with flags referring to consistency based on earlier kind of rules.

Question: Are there any solutions / conception alternatives that would ease this process ?

To illustrate in a Cook programming language; An example of a set of rules and following actions:

run1 : WHEN tableA.ID == tableB.ID AND tableA.column1 > tableB.column2
       BACK-UP 
       FLAG tableA.rule1
AFTER run1 : LOG ('WARN')

run2 : WHEN tableA.column1 > 0
       DO NOT BACK-UP 
       FLAG tableA.rule2
AFTER run2 : LOG ('ERROR')

Note: While constraint programming is in theory a paradigm for solving combinatorial problems and in practice can speed problem development and execution; I think this is different than a constraint solving problem; As the first purpose is not for optimizing constraints before resolution, probably not even limiting data domains; It's main concern is to apply rules on data reception and execute some basic actions (Reject a line, Accept a line, Logging...).

I really hope this is not a very broad question and this is the right place.


回答1:


I found a sophisticated solution to achieve more than what I thought; talking about checking data consistency. Apparently this is what we would call test-driven data analysis

So now with this implementation we are bound to Python, and Pandas, but fortunately, not only. We can even check data consistency in MySQL, PostgreSQL ... tables.

The plus I did not think about, is that we can infer rules based on sample data. This could be helpful for setting rules. This is why there is tdda.constraints.verify_df and the tdda.constraints.discover_df.

As far as I read about, It does not propose a solution for checking (a weaker) consistency on last (n) files. Something I thought about that we could call batch files consistency, that only ensures a rule satisfaction for some set of runs (last n runs) and not all data. It only acts on single files, it needs a higher level wiring to be able to condition (n) files that arrive successively.

For more: https://tdda.readthedocs.io/en/latest/constraints.html#module-tdda.constraints

assertCSVFilesCorrect Checks a set of files in a directory, same is possible for Pandas dataframes, etc.

From the official documentation:

The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.

ps: I am still open to other solutions, let me know as I imagine this is a use case for any ETL solution.

I also open a bounty to further enrich responses.




回答2:


You can also look into SQL transactions. A transaction consists of one or more statements, which are asked to be executed by a single user or an application. They can read or even modify data in a database.

START TRANSACTION
Do DB stuff, check if constraints are violated
COMMIT

You can specify certain constrains and use ROLLBACK if one of these constraints is violated. The rollback can can be explicitly coded by the developer but can be thrown from the system as well. (e.g. when an error appeared that is not handled explicitly by the developer, or when executing a trigger). Transactions may not be in the way of each other. They have to be executed in „isolated“ manner. several concurrent transactions must produce the same results in the data as those same transactions executed sequentially, in some (unspecified) order. Since all modern DBMS guarantee ACID properties when it comes to transactions, the execution of transactions is reliable, so the state of your database shouldn't have any inconsistencies in it.

Not sure if this is what you mean, but maybe it helps.



来源:https://stackoverflow.com/questions/58199623/constraint-database

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!