Repository for Petra's work at ampli Jan-Feb 2019
Petra Lamborn 912fd2eff0 Readme, minor fixes 5 years ago
R Combined and shiny documentation 5 years ago
img Readme for weathermod 5 years ago
py Readme, minor fixes 5 years ago
shiny Fix some stuff with the shiny app so that it actually works 5 years ago
sql Start writing readme 5 years ago
.gitignore Save an rds file of all models 5 years ago
README.md Readme, minor fixes 5 years ago
notes.md Add and reorganise notes 5 years ago
requirements.txt Update requirements 5 years ago

README.md

ampli

Repository for Petra Lamborn’s internship at ampli Jan-Feb 2019

This repository serves as both a reference for the folk at ampli to build upon what I produced for them during this internship, and as a CV item for me. Non-ampli people will not be able to directly use all but a handful of the below, but it still demonstrates what I can do.

The purpose of this project was to cluster the electricity demand time series of a large number of ICPs (Installation Control Points), and then project the patterns for these clusters into the future.

What isn’t in this repository

This repository does not contain with it downloaded data (for confidentiality and size reasons), the fitted models (similar issues), or the configuration file for downloading from the database (which contains a password). Recreate an empty data/ directory and models/. The configuration file is at py/database.ini and looks like:

[postgresql]
host=<Hostname>
database=<Database>
user=<Username>
password=<Password>

This file is based on an example from the postgresSQL website. Replace <Hostname>, <Database>, <Username> and <Password> with valid credentials.

SQL

I’ve included a bunch of annotated SQL queries in sql/queries.pgsql and sql/weather.pgsql. The latter file is specifically about the weather dataset; the former is more general. Some of the queries are copied into the py/util.py file and similar, but changing these two files will do nothing on their own.

The queries.pgsql file includes the commands used to create the tables available in the --table option of the downkwh.py script.

Python

requirements.txt contains the python packages required to set up a virtual environment with virtualenv -p /usr/bin/python3 venv and pip install -r requirements.txt. Notably these are:

  • numpy
  • pandas
  • pkg-resources
  • psycopg2-binary
  • pyparsing
  • python-dateutil
  • pytz
  • scipy
  • seaborn
  • statsmodels
  • tabview

Virtual environments are loaded with source venv/bin/activate. The python scripts are in the py/ folder. The scripts that are designed to be called directly are called by python <scriptname.py>; use python <scriptname.py> -h to view help. Note that most options will have a default, which may not be what you want, so always check.

util.py

This script is imported by several other scripts, particularly for downloading the data from the database.

downkwh.py

Downloads demand data from the database. Options:

  • -o PATH: The path for the python “pickle” file to store the result in.
  • -s DATE: The start date for the download in YYYY-MM-DD format; default of 2017-01-01.
  • -e DATE: The end date in YYYY-MM-DD format; default of 2018-01-01.
  • -t TABLE: The table in the database from which to obtain the wanted ICP ids; default is public.icp_sample, a table which contains 1000 ICPs with good data for 2017. Important: Don’t assume that SQL injection can’t come through this vector, although I have constrained the values that this script will accept from the command line to the following list:
    • public.best_icp, All icps with at least 360 days of data in 2017
    • public.best_icp_1618, All icps with at least 720 days of data in 2 years from 1 April 2016
    • public.best_icp_18m, All icps with at least 540 days of data from July 2016 to end of 2017
    • public.icp_sample, A pre-generated 1k sample from best_icp
    • public.icp_sample_5k, A pre-generated 5k sample from best_icp
    • public.icp_sample_1618, A pre-generated 1k sample from best_icp_1618
    • public.icp_sample_18m, A pre-generated 1k sample from best_icp_18m
  • -n NUM: The algorithm downloads the dataset in pieces, optimises them to reduce storage space, and reassembles. This option defines the number of such pieces; it should always be less than the number of days between the start and end days. Default of 12.
  • --no-pivot: This option can probably be ignored, as it downloads the dataset in a less efficient non-“pivoted” form, which was used in the original versions of some of these scripts.
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python downkwh.py -o ../data/test1k.pkl -n 24

Downloads data from the default period into ../data/test1k.pkl with 24 segments used.

downweather.py

Downloads weather (temperature and humidity) data from the database, from one specified station.

  • -o PATH: The path for the python “pickle” file to store the result in.
  • -s DATE: The start date for the download in YYYY-MM-DD format; default of 2016-04-01.
  • -e DATE: The end date in YYYY-MM-DD format; default of 2019-01-01.
  • --station: The station to download from; default is 2006 which is located near Pukekohe.
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python downweather.py -o ../data/weathertest.pkl

Downloads data from the default period into ../data/weathertest.pkl.

dcorr.py

Create a distance correlation matrix from pickled pandas dataframe. Note that this takes some time, but has a progress bar.

  • -i PATH: The path for the python “pickle” file retrieve the original data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • -p: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour.

Example:

python dcorr.py -i ../data/test1k.pkl -o ../data/test1kdcorr.pkl

Calculates correlation matrix from the dataset previously downloaded to ../data/test1k.pkl and store in ../data/test1kdcorr.pkl.

clustering.py

Algorithm for hierarchical clustering. Note: different datasets and parameters may make similar clusters in a different order.

  • -i PATH: The path for the python “pickle” file retrieve the original data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • --method METHOD: Clustering method for hierarchical clustering; default is ‘ward’. See the scipy docs for more information and other options.
  • --clusters NUM: The number of clusters; default is 9.
  • -d: Include dendrogram; requires -t/--tree.
  • -t PATH: Path to save dendrogram image in if -d
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python clustering.py -i ../data/test1kdcorr.pkl -o ../data/test1kclustable.pkl -d -t ../img/test1kdendro.png

Clusters data from ../data/test1kdcorr.pkl; saves table of clusters/icps at ../data/test1kclustable.pkl, save dendrogram at ../img/test1kdendro.png.

agg.py

Aggregates data based on clusters. Note: columns CI_low and CI_high do not represent formal statistical confidence intervals, merely lower and upper quartiles; for indicative purposes only.

  • -i PATH: The path for the python “pickle” file retrieve the original data from (i.e. the data downloaded from downkwh.py).
  • -c PATH: The path for the python “pickle” file retrieve the cluster data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • -d: Drop cluster ‘-1’, which represents the miscellaneous/unclustered pseudo-cluster produced by clusAssign.py.
  • -p: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour.

Example:

python agg.py -i ../data/test1k.pkl -c ../data/test1kclustable.pkl -o ../data/test1kagg.pkl

Aggregates data from ../data/test1k.pkl by cluster information in ../data/test1kclustable.pkl, save in ../data/test1kagg.pkl.

pickletocsv.py

Helper function to transform a pickle into a csv file, for easier importing into e.g. Excel.

  • input: The path for the python “pickle” file which contains the dataset.
  • output: The path for the csv file to store the dataset in; if omitted, or -, prints to stdout.
  • -r: Include row names/index labels in csv. This may be essential for proper exporting of some datasets
  • -v: Output extra information, including dimensions of dataset.

Examples:

python pickletocsv.py ../data/test1kagg.pkl | less

Reads file at ../data/test1kagg.pkl and views it in the UNIX pager less.

python pickletocsv.py ../data/test1kagg.pkl - | tabview -

Reads the same file and views it using the tabview python module (included in requirements.txt). - in this case is shorthand for stdout and stdin respectively, allowing the pipe.

clusAssign.py

Assigns clusters found from one dataset to the values of another. Note: this algorithm can assign some ICPs to cluster -1, which means that it failed to assign to a cluster. Further note: this method requires both datasets to be on the same timespan.

  • -i PATH: The path for the file which contains the dataset.
  • -a PATH: The path for aggregated dataset to compare to this one.
  • -c PATH: Path for the output.
  • -t NUM: Correlation threshold for clustering: if best correlation is less than this number the icp will be assigned to cluster -1. Default -1, can be any value greater than or equal to -1 and less than 1.

Example:

python downkwh.py -o ../data/test1kb.pkl -t public.icp_sample_18m
python clusAssign.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -a ../data/test1kagg.pkl -t 0.1
python agg.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -o ../data/test1kbagg.pkl -d

Downloads new dataset from the public.icp_sample_18m sample and saves it to ../data/test1kb.pkl. Then assigns clusters to this (excluding the misc/‘-1’ cluster) from the ../data/test1kagg.pkl dataset with threshold 0.1, saving into ../data/test1kbclustertable.pkl. Then aggregates this dataset and saves in ../data/test1kbagg.pkl.

pymodels.py

This script is a rewrite of the below weathmod.R and combmodels.R R scripts. It fits a harmonic model to an aggregated dataset. Note: the model file created is quite large (on the order of 500MB). This could probably be pruned down.

  • -i PATH: Path of the file that contains the (aggregated) dataset to fit models to.
  • -w PATH: Path of the weather data involved in the model.
  • -m MODEL_FILE: Filename to save the model to, as pickle. Note: this is not the same kind of pickle that pickletocsv.py can read.
  • --weather-harmonics NUM: Number of harmonics (with base period of 1 year/365.25 days) to fit to the weather data; default is 2. Adding more harmonics leads to a more complicated model which may be more powerful but may also “overfit.”
  • --icp-harmonics NUM NUM NUM: (3 values) Number of harmonics of base period 1 year, 1 week, 1 day to fit, respectively. Default is 2, 3, and 3. Adding more harmonics leads to a more complicated model which may be more powerful but may also “overfit.”

Example:

python pymodels.py -i ../data/test1kagg.pkl -w ../data/weathertest.pkl -m ../models/testmod.pkl

Fit all clusters in ../data/test1kagg.pkl with weather data in ../data/weathertest.pkl and save to ../models/testmod.pkl.

predict.py

Predict unobserved demand values for given cluster, time period, supplying either maximum/minimum temperatures (as in the shiny app) or with a weather dataset.

  • -m MODEL_FILE: Filename to retrieve the model from, as pickle.
  • -w WEATHER_FILE: Path to weather data. This is optional, but if not specified the temperature parameter should be.
  • -o OUTPUT_FILE: File to output to. If - or absent prints to stdout.
  • -t TEMP TEMP: (2 values) If not supplying a weather file, can specify a minimum overnight and maximum daytime temperature value, similar to the shiny app.
  • -s START_DATE: The start date for the prediction interval, in YYYY-MM-DD format.
  • -e END_DATE: The end date for the prediction interval, in YYYY-MM-DD format.
  • -c CLUSTER: The cluster to be predicted.
  • --pkl: Output as a pickled dataframe rather than a csv file.

Examples:

python predict.py -m ../models/testmod.pkl -s 2018-01-01 -e 2018-02-01 -w ../data/weathertest.pkl -c 1 | tabview -

For cluster 1, model ../models/testmod.pkl, weather data ../data/weathertest.pkl, predict per ICP demand for the month of Jan 2018 and view in tabview viewer.

python predict.py -m ../models/testmod.pkl -t 5 10 -s 2019-07-01 -e 2019-07-02 -c 1 | tabview -

For cluster 1, model ../models/testmod.pkl, minimum overnight temperature 5 degrees C, and maximum temperature 10 degrees C, predict per ICP demand for the first of July 2019.

R

The scripts in R/ include visualisers for the data, and for the creation of some models.

Installing R

To install R on an ec2 instance you need to run the following:

sudo yum install R
sudo yum install R-devel
sudo yum install libcurl-devel
sudo ln -s /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so
sudo ln -s /usr/lib64/libquadmath.so.0 /usr/lib64/libquadmath.so

To install the packages needed for these scripts you should run the following from within R itself:

install.packages(c("dplyr", "tidyr", "ggplot2", "forecast", "TSA", "reticulate", "caTools", "scales", "argparse"))

clusterviz.R

This script visualises the patterns of each cluster in an aggregated file, saving plots in a directory. Unfortunately, when loading directly from a pickle (with the --pkl option) this requires the python module pandas to have been installed outside of a virtual environment, with pip install --user pandas. The script is called with Rscript clusterviz.R [options].

  • cluster_file: Source csv or pickle for visualisation.
  • -i IMG_PATH: Path for directory to store plots; default: ../img/.
  • -p POSTFIX: Postfix for image files, identifying this “batch” of clusters; default: _plot.
  • --width WIDTH: Width of plot output in cm
  • --height HEIGHT: Height of plot output in cm
  • --pkl: Load directly from a pickle instead of a csv file.

Example:

python py/pickletocsv.py data/test1kbagg.pkl data/test1kbagg.csv
Rscript R/clusterviz.R data/test1kbagg.csv -p "_1kb" --img img/

Create csv for the pickle created by clusAssign.py, then create visualisation plots for all clusters and postfix them with _1kb (as in all_fre_1kb.png).

fftest.R

This is a now-redundant script testing fitting repeating patterns with Fourier transforms.

weathmod.R

This script takes the downloaded weather data, (hard-coded path: ../data/2016-18-weather.pkl) and calculates running last-24-hour minimum and maximum temperatures. It then fits harmonic regressions using the first and second harmonic to de-trend the yearly pattern of each series, saving the resulting dataset at ../data/weatherharm.csv and stores the models in ../models/weatherMinharmonic.rds and ../models/weatherMaxharmonic.rds. These paths can be changed in the source (and will likely need to be as the data files are not included in the repository), but there are no command line options.

Commented out in the source are a number of potentially informative plots showing this process. For example:

Detrended maximum daily temperature

Black is observed values; blue is harmonic trend; green is residuals/de-trended data.

The justification for this is that we want to include the deviation from typical value for that time of year in our model, not the value itself.

As written, this script assumes a python virtual environment located in ../venv/.

combmodels.R

Taking ../data/9-clusters.agg.pkl and ../data/weatherharm.csv as inputs, this script first fits a harmonic model to cluster 1 and tests it with the assigned clusters in ../data/1617-agg.pkl. It then fits models to all clusters and saves them collectively in ../models/1kmods.rds.

Similar to the previous, some plots are again commented out.

As written, this script assumes a python virtual environment located in ../venv/.

Shiny App

The R Shiny app is located in shiny/app.R. It requires the outputs from weathmod.R and combmodels.R, along with a couple of extra packages, installed with:

install.packages(c("shiny", "shinycssloaders"))

Explaining Shiny and how to properly install it on a server in order to have a nice web app on a domain behind https etc would take more time than I have (email me!) but for an example of a simple, self-hosted app in action see this project I made awhile ago (source).

The app in this project can be run with Rscript app.R; this will (after several seconds of start-up) launch the app at a random port (e.g. 5576). Navigating to this location in your browser (firewall permitting) should display the app.

The “data and model” tab allows exploration of the data along with the model projection; plots can be saved by right-clicking. The “prediction” tab allows, after specification of the number of icps in each of nine clusters, the temperature range, and the month and day of the week, the projected demand pattern. The month is approximate because it models a “typical” day in the approximate middle of that month, without a specific date.