I have a single large table which I would like to optimize. I\'m using MS-SQL 2005 server. I\'ll try to describe how it is used and if anyone has any suggestions I would appreci
Your query plan basically shows the following:
The plan suggest an index, which should improve perm by 81% - k1, k4, k5, k6, k3 + include d1 & k7. I don't know how long it would take to build such an index and see the results, but as I've commented here, it will effectively double the size of your table, simply because almost every column is present in the index. Also inserts will be slower.
As many people have suggested, partitioning is the best strategy here, e.g. make one table for example have k3 values from 1 to 3, another from 4 to 7, and the third from 8 to 10. With SQL Server Enterprise partitioning is done using a CHECK constraint on this column, the query optimizer will determine which table out of n to scan/seek depending on the parameter value for the column.
As I hinted in a comment, I have done this with a single Oracle table approaching 8 TB consisting of over two billion rows growing at the rate of forty million rows per day. However, in my case, the users were two million (and growing) customers accessing this data over the web, 24x7, and literally ANY of the rows was subject to being accessed. Oh, and new rows had to be added within two minutes of real-time.
You are probably I/O bound, not CPU or memory bound, so optimizing the disk access is critical. Your RAM is fine--more than adequate. Using multiple cores would be helpful, but limited if the I/O is not parallelized.
Several people have suggested splitting up the data, which should be taken seriously since it is far better and more effective than any other solution (nothing is faster than not touching the data at all).
You say you can't split the data because all the data is used: IMPOSSIBLE! There is no way that your users are paging through one million rows per day or one hundred million rows total. So, get to know how your users are ACTUALLY using the data--look at every query in this case.
More importantly, we are not saying that you should DELETE the data, we are saying to SPLIT the data. Clone the table structure into multiple, similarly-named tables, probably based on time (one month per table, perhaps). Copy the data into the relevant tables and delete the original table. Create a view that performs a union over the new tables, with the same name as the original table. Change your insert processing to target the newest table (assuming that it is appropriate), and your queries should still work against the new view.
Your savvy users can now start to issue their queries against a subset of the tables, perhaps even the newest one only. Your unsavvy users can continue to use the view over all the tables.
You now have a data management strategy in the form of archiving the oldest table and deleting it (update the view definition, of course). Likewise, you will need to create a new table periodically and update the view definition for that end of the data as well.
Expect to not be able to use unique indexes: they don't scale beyond about one-to-two million rows. You may also have to modify some other tactics/advice as well. At one hundred million rows and 400 GB, you have entered another realm of processing.
Beyond that, use the other suggestions--analyze the actual performance using the many tools already available in SQL Server and the OS. Apply the many well-known tuning techniques that are readily available on the web or in books.
However, do NOT experiment! With that much data, you don't have time for experiments and the risk is too great. Study carefully the available techniques and your actual performance details, then choose one step at a time and give each one a few hours to days to reveal its impact.
First off, spend a day with SQL Profiler running in the background. At the end of the day, save the trace data to a file and have the Optimization wizard pour over it and evaluate your current index. That should tell you if changing the indexed fields, sort order, etc. can give you any significant gains. Do not let the wizard make the changes. If the percentage performance gain looks significant (> 30% IMHO), go ahead and make the change yourself.
Your index has to be getting on the large side. You may want to schedule a job (overnight, a couple times a week) to do the following:
That will keep it speedy once you have tuned the indexes.
I think a clustered index on K7 is the only thing of any value. The rest of your where clause has such low selectivity that it's a waste of time.
Unless you can take advantage of some specific knowledge of your values (maybe k5 is only true if k4 < 0, or something), you're pretty much looking at a clustered index scan. Might as well make it the field that you're ordering by.
Looking at the low numbers of distinct values in k3 - k6, you'd probably only need to read < 1.5 million rows to get your top 1 million. That's probably the best you're going to do - especially since any other plan would need you to order by k7 anyway to evaluate your TOP clause.
It looks like you only want the earliest "g" records? Maybe only the most recent "g" records?
Basically you want your query to only read the most recent/oldest records. You don't want to query the entire 400GB do you? If this is the case, you might consider archiving the majority of the 400GB, or keeping the most recently inserted records in a "current" table that you can query. You can keep the records in the current table current through dual inserts, or through a trigger on the table (shudder). But the basic premise is that you run your query against as small a table as possible. This is basically poor-man's table partitioning.
Have you considered creating a surrogate identity column (type bigint) and using that as the clustered index? Then create your primary key as a non-clustered unique index.
With a table of this size, it's quite possible that index and page fragmentation are a big performance problem. The surrogate clustered index will ensure that all inserts are at the end of the table, which can almost completely eliminate page fragmentation (unless rows get deleted). Less page fragmentation == more pages per IO, which is a very good thing.
This will also allow you to periodically defrag the unique index that you are querying on, which will make it much more effective. Do this often, or at least monitor index fragmentation on this table regularly.
These performance improvements can be quite dramatic -- if your current PK is highly fragmented, an index seek can involve a great deal more IO than it should.
Once you've implemented this, consider (aka, try it and measure ;-) adding a nonclustered index on column k7.