问题
Postgres version 9.4.18, PostGIS Version 2.2.
Here are the tables I'm working with (and can unlikely make significant changes to the table structure):
Table ltg_data
(spans 1988 to 2018):
Column | Type | Modifiers
----------+--------------------------+-----------
intensity | integer | not null
time | timestamp with time zone | not null
lon | numeric(9,6) | not null
lat | numeric(8,6) | not null
ltg_geom | geometry(Point,4269) |
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data
(~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size
------------------
149729288192
Table counties
:
Column | Type | Modifiers
-----------+-----------------------------+--------------------------------- -----------------------
gid | integer | not null default nextval('counties_gid_seq'::regclass)
objectid_1 | integer |
objectid | integer |
state | character varying(2) |
cwa | character varying(9) |
countyname | character varying(24) |
fips | character varying(5) |
time_zone | character varying(2) |
fe_area | character varying(2) |
lon | double precision |
lat | double precision |
the_geom | geometry(MultiPolygon,4269) |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)
Desired result:
I want a time series with one row for every day of the year in format 'MM-DD' ignoring the year: 01-01, 01-02, 01-03, ..., 12-31. And the count of rows in table ltg_data
for each day of the year. I also eventually want the same thing for every hour of every day of the year ('MM-DD-HH').
A group by
statement should accomplish this, but I'm having a hard time joining the "big" table with the days generated with generate_series()
.
MM-DD | total_count
-------+------------
12-22 | 9
12-23 | 0
12-24 | 0
12-25 | 0
12-26 | 23
12-27 | 0
12-28 | 5
12-29 | 0
12-30 | 0
12-31 | 0
Some of my many attempted queries:
SELECT date_trunc('day', d),
count(a.lat) AS strikes
FROM generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d
LEFT JOIN
(SELECT date_trunc('day', TIME) AS day_of_year,
ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON d = day_of_year
GROUP BY d
ORDER BY d ASC;
But this doesn't ignore year. I shouldn't be surprised because the "day" in date_trunc is still considering the year I guess.
2017-12-27 00:00:00-08 | 0
2017-12-28 00:00:00-08 | 0
2017-12-29 00:00:00-08 | 0
2017-12-30 00:00:00-08 | 0
2017-12-31 00:00:00-08 | 0
2018-01-01 00:00:00-08 | 0
2018-01-02 00:00:00-08 | 12
2018-01-03 00:00:00-08 | 0
And this query, in which I'm trying to convert the data from generate_series()
to text
in 'DD-MM' format to join to the ltg_data
table in text
format. Says the data types don't match. I've tried extract
as well, since that could provide "doy" and "hour", which would work, but I can't seem to match data types in that query either. It's hard to make that "generate_series" a double precision.
SELECT to_char(d, 'MM-DD') AS DAY,
count(a.lat) AS strikes
FROM
(SELECT generate_series('2017-01-01', '2018-12-31', interval '1 day') AS d)
AS f
LEFT JOIN
(SELECT to_char(TIME, 'MM-DD') AS day_of_year,
ltg_data.lat
FROM ltg_data
JOIN counties ON ST_contains(counties.the_geom, ltg_data.ltg_geom)
WHERE cwa = 'MFR' ) AS a ON f = day_of_year
GROUP BY d
ORDER BY d ASC;
Result:
ERROR: operator does not exist: record = text
LINE 4: ON f = day_of_year group by d order by d asc;
^
HINT: No operator matches the given name and argument type(s). You might
need to add explicit type casts.
Conclusion: I'm aiming at getting daily and hourly total counts that span many years but group by 'MM-DD' and 'MM-DD-HH' (ignoring year), with the query results showing all days/hours even if they are zero.
Later I'll also try to find averages and percentiles for days and hours, so if you have any advice on that, I'm all ears. But my current problem is focused on just getting a complete result for totals.
回答1:
Basically, to cut off the year, to_char(time, 'MMDD')
like you already tried does the job. You just forgot to also apply it to the timestamps generated with generate_series()
before joining. And some other minor details.
To simplify and for performance and convenience I suggest this simple function to calculate an integer
from the pattern 'MMDD' of a given timestamp
.
CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';
I used to_char(time, 'MMDD')
at first, but switched to the above expression that turned out to be fastest in various tests.
db<>fiddle here
It can be used in expression indexes since it's defined IMMUTABLE
. And it still allows function inlining because it only uses EXTRACT (xyz FROM date)
- which is implemented with the IMMUTABLE
function date_part(text, date)
internally. (Note that datepart(text, timestamptz)
is only STABLE
).
Then this kind of query does the job:
SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd -- ignoring the year
FROM generate_series(timestamp '2018-01-01' -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM counties c
JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) ct USING (mmdd)
ORDER BY 1;
Since time
(I would use a different column name) is data type timestamptz
the cast time::date
depends on the time zone setting of your current session. ("Days" are defined by the time zone you are in.) To get immutable (but slower) results use the AT TIME ZONE
construct with a time zone name like:
SELECT f_mmdd((time AT TIME ZONE 'Europe/Vienna')::date) ...
Details:
- Ignoring time zones altogether in Rails and PostgreSQL
Format mmdd
any way you like for display.
The cast to integer
is optional for the purpose of this particular query. But since you plan to do all kinds of queries, you'll end up wanting an index on the expression:
CREATE INDEX ltg_data_mmdd_idx ON event(f_mmdd(time));
(Not needed for this query.)integer
is a bit faster for this purpose.
And you need the (otherwise optional) function wrapper for this since The updated expression to_char()
is only defined STABLE
, but we need IMMUTABLE
for the index.(EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int
is IMMUTABLE
, but the function wrapper is still convenient.
Related:
- How do you do date math that ignores the year?
- Generating time series between two dates in PostgreSQL
来源:https://stackoverflow.com/questions/50374136/how-to-aggregate-data-from-multiple-years-on-mm-dd-ignoring-year