Why is SELECT *
bad practice? Wouldn\'t it mean less code to change if you added a new column you wanted?
I understand that SELECT COUNT(*)
Think of it as reducing the coupling between the app and the database.
To summarize the 'code smell' aspect:
SELECT *
creates a dynamic dependency between the app and the schema. Restricting its use is one way of making the dependency more defined, otherwise a change to the database has a greater likelihood of crashing your application.
If you add fields to the table, they will automatically be included in all your queries where you use select *
. This may seem convenient, but it will make your application slower as you are fetching more data than you need, and it will actually crash your application at some point.
There is a limit for how much data you can fetch in each row of a result. If you add fields to your tables so that a result ends up being over that limit, you get an error message when you try to run the query.
This is the kind of errors that are hard to find. You make a change in one place, and it blows up in some other place that doesn't actually use the new data at all. It may even be a less frequently used query so that it takes a while before someone uses it, which makes it even harder to connect the error to the change.
If you specify which fields you want in the result, you are safe from this kind of overhead overflow.
Even if you wanted to select every column now, you might not want to select every column after someone adds one or more new columns. If you write the query with SELECT *
you are taking the risk that at some point someone might add a column of text which makes your query run more slowly even though you don't actually need that column.
Wouldn't it mean less code to change if you added a new column you wanted?
The chances are that if you actually want to use the new column then you will have to make quite a lot other changes to your code anyway. You're only saving , new_column
- just a few characters of typing.
Using SELECT *
when you only need a couple of columns means a lot more data transferred than you need. This adds processing on the database, and increase latency on getting the data to the client. Add on to this that it will use more memory when loaded, in some cases significantly more, such as large BLOB files, it's mostly about efficiency.
In addition to this, however, it's easier to see when looking at the query what columns are being loaded, without having to look up what's in the table.
Yes, if you do add an extra column, it would be faster, but in most cases, you'd want/need to change your code using the query to accept the new columns anyways, and there's the potential that getting ones you don't want/expect can cause issues. For example, if you grab all the columns, then rely on the order in a loop to assign variables, then adding one in, or if the column orders change (seen it happen when restoring from a backup) it can throw everything off.
This is also the same sort of reasoning why if you're doing an INSERT
you should always specify the columns.
The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.
The *
shorthand can be slower because:
SELECT *
over the wire risks a full table scanWhen using SELECT *
:
SELECT *
will hide an error waiting to happen if a table had its column order changed.SELECT *
is an anti-pattern:
It's acceptable to use SELECT *
when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.
Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.
There is also more pragmatic reason: money. When you use cloud database and you have to pay for data processed there is no explanation to read data that you will immediately discard.
For example: BigQuery:
Query pricing
Query pricing refers to the cost of running your SQL commands and user-defined functions. BigQuery charges for queries by using one metric: the number of bytes processed.
and Control projection - Avoid SELECT *:
Best practice: Control projection - Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results).
Using SELECT * is the most expensive way to query data. When you use SELECT *, BigQuery does a full scan of every column in the table.