The database

Accessed either via the SQL Manager program on the laptop, the psql terminal command (via the psconnect alias), or with the psycopg2 (psycopg2-binary) python library.

Have created an experimental table called public.coup_tall_april containing data from April 2017 in a “tall” format, using code from Jason:


CREATE TABLE public.coup_tall_april AS
SELECT  a.icp_id
     , a.read_date
     , c.period
     , sum(c.read_kwh) as kwh_tot
     , sum(case when a.content_code = 'UN' then c.read_kwh else 0 end) as kwh_un
     , sum(case when a.content_code in ('CN','EG') then c.read_kwh else 0 end) as kwh_cn
FROM    coup_prd.coupdatamaster a,
	unnest(a.read_array) WITH ORDINALITY c(read_kwh, period)
WHERE   a.read_date >= to_date('01/04/2017','dd/mm/yyyy')
 and   a.read_date <  to_date('01/05/2017','dd/mm/yyyy')
 and   a.content_code  ~ ('UN|CN|EG')
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;

This data looks like:


SELECT * FROM public.coup_tall_april limit 10;

icp_id | read_date | period | kwh_tot | kwh_un | kwh_cn ---------+------------+--------+---------+--------+-------- I000002 | 2017-04-01 | 1 | 0.123 | 0.123 | 0.0 I000002 | 2017-04-01 | 2 | 0.161 | 0.161 | 0.0 I000002 | 2017-04-01 | 3 | 0.118 | 0.118 | 0.0 I000002 | 2017-04-01 | 4 | 0.108 | 0.108 | 0.0 I000002 | 2017-04-01 | 5 | 0.125 | 0.125 | 0.0 I000002 | 2017-04-01 | 6 | 0.144 | 0.144 | 0.0 I000002 | 2017-04-01 | 7 | 0.11 | 0.11 | 0.0 I000002 | 2017-04-01 | 8 | 0.116 | 0.116 | 0.0 I000002 | 2017-04-01 | 9 | 0.197 | 0.197 | 0.0 I000002 | 2017-04-01 | 10 | 0.144 | 0.144 | 0.0

icp_id is the ID of the ICP, which may be a home or business. This is a varchar(10), although it appears to only have 7 characters. The ID is not the real ID, but an anonymised value.
read_date is the date, in this case in April 2017.
kwh_cn is the demand in kwh that the company has some level of control over, e.g. by systems that turn off and on water heaters remotely. This ought to be relatively stable, although in many cases this will be 0.
kwh_un is the uncontrolled demand, i.e. the rest.
kwh_tot is the sum of the other kwh measurements for this half-hour

Statistics

This dataset includes 34278 distinct ICPs: SELECT COUNT(DISTINCT icp_id) FROM public.coup_tall_april;

Not every day has the same number of ICPs recorded for it:

(Some are added and removed each day; some are disconnected for various reasons on different days


SELECT read_date, COUNT(DISTINCT icp_id) AS d_icp 
FROM public.coup_tall_april 
GROUP BY read_date;

Days have similar averages (within the same month), but sometimes values are negative:

Negative values possibly from solar power injecting back into the network, but is so rare it can be ignored.


SELECT read_date, min(kwh_tot), Avg(kwh_tot), max(kwh_tot) 
FROM public.coup_tall_april 
GROUP BY read_date;

Three values in this table are negative:


SELECT * FROM public.coup_tall_april WHERE kwh_tot < 0 OR kwh_un < 0 OR kwh_cn < 0;

icp_id | read_date | period | kwh_tot | kwh_un | kwh_cn ---------+------------+--------+---------+---------+-------- I017181 | 2017-04-12 | 19 | -30.530 | -30.585 | 0.055 I019141 | 2017-04-29 | 37 | -31.445 | -31.445 | 0 I019141 | 2017-04-29 | 38 | -31.624 | -31.624 | 0

There are 334 values in this table where the icp_id ends in 17:


SELECT COUNT (DISTINCT icp_id) FROM public.coup_tall_april WHERE icp_id LIKE '%17';

SELECT DISTINCT icp_id FROM public.coup_tall_april WHERE icp_id LIKE '%17' ORDER BY icp_id LIMIT 10;

This could be useful for sampling later.

Research

Roelofsen, Time Series Clustering

A master’s thesis from the Netherlands, which gives a lot of useful information. (link.)

Notes in notebook; copy next week. Up to page 54.

Alonso, Time Series Clustering

A presentation, which I still need to read. (link)

Twoards Data Science, Playing with time series data in python

(link.)

Minnaar, Time Series Classification and Clustering with Python

http://alexminnaar.com/time-series-classification-and-clustering-with-python.html

notes.md 4.4KB Geschiedenis Ruw