I have recently started learning about PyTables and found it very interesting. My question is:
- What are the basic advantages of PyTables over database(s) when it comes to huge datasets?
- What is the basic purpose of this package (I can do same sort of structuring in NumPy and Pandas, so what's the big deal with PyTables)?
- Is it really helpful in analysis of big datasets? Can anyone elaborate with the help of any example and comparisons?
Thank you all.
What are the basic advantages of PyTables over database(s) when it comes to huge datasets?
Effectively, it is a database. Of course it's a hierarchical database rather than a 1-level key-value database like dbm
(which are obviously much less flexible) or a relational database like sqlite3
(which are more powerful, but more complicated).
But the main advantage over a non-numerics-specific database is exactly the same as the advantage of, say, a numpy ndarray
over a plain Python list
. It's optimized for performing lots of vectorized numeric operations, so if that's what you're doing with it, it's going to take less time and space.
What is the basic purpose of this package
Quoting from the first line of the front page (or, if you prefer, the first line of the FAQ):
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
There's also a page listing the MainFeatures, linked near the top of the front page.
(I can do same sort of structuring in NumPy and Pandas, so what's the big deal with PyTables)?
Really? You can handle 64GB of data in numpy or pandas on a machine with only 16GB of RAM? Or a 32-bit machine?
No, you can't. Unless you split your data up into a bunch of separate sets that you load, process, and save as needed—but that's going to be much more complicated, and much slower.
It's like asking why you need numpy when you can do the same thing with just regular Python list and iterators. Pure Python is great when you have an array of 8 floats, but not when you have a 10000x10000 array of them. And numpy is great when you have a couple of 10000x10000 arrays, but not when you have a dozen interconnected arrays ranging up to 20GB in size.
Is it really helpful in analysis of big datasets?
Yes.
Can anyone elaborate with the help of any example…
Yes. Rather than copying all of the examples here, why don't you just look at the simple examples on the front page of the docs, the slew of examples in the source tree, the links to real-world use cases two clicks from the front page of the docs, etc.?
If you want to convince yourself of the usefulness of PyTables, take any of the examples and scale it up to 32GB worth of data, then try to figure out how you'd do the exact same thing in numpy or pandas.
来源:https://stackoverflow.com/questions/16660617/what-is-the-advantage-of-pytables