# ampli _Repository for Petra Lamborn's internship at [ampli](https://www.ampli.nz) Jan-Feb 2019_ This repository serves as both a reference for the folk at ampli to build upon what I produced for them during this internship, and as a CV item for me. Non-ampli people will not be able to directly use all but a handful of the below, but it still demonstrates what I can do. The purpose of this project was to cluster the electricity demand time series of a large number of ICPs (Installation Control Points), and then project the patterns for these clusters into the future. ## What isn't in this repository This repository does not contain with it downloaded data (for confidentiality and size reasons), the fitted models (similar issues), or the configuration file for downloading from the database (which contains a password). Recreate an empty `data/` directory and `models/`. The configuration file is at `py/database.ini` and looks like: ```ini [postgresql] host= database= user= password= ``` This file is based on an example from [the postgresSQL website](http://www.postgresqltutorial.com/postgresql-python/connect/). Replace ``, ``, `` and `` with valid credentials. ## SQL I've included a bunch of annotated SQL queries in `sql/queries.pgsql` and `sql/weather.pgsql`. The latter file is specifically about the weather dataset; the former is more general. Some of the queries are copied into the `py/util.py` file and similar, but changing these two files will do nothing on their own. The `queries.pgsql` file includes the commands used to create the tables available in the `--table` option of the `downkwh.py` script. ## Python `requirements.txt` contains the python packages required to set up a virtual environment with `virtualenv -p /usr/bin/python3 venv` and `pip install -r requirements.txt`. Notably these are: * numpy * pandas * pkg-resources * psycopg2-binary * pyparsing * python-dateutil * pytz * scipy * seaborn * statsmodels * tabview Virtual environments are loaded with `source venv/bin/activate`. The python scripts are in the `py/` folder. The scripts that are designed to be called directly are called by `python `; use `python -h` to view help. Note that most options will have a default, which may not be what you want, so always check. ### `util.py` This script is imported by several other scripts, particularly for downloading the data from the database. ### `downkwh.py` Downloads demand data from the database. Options: * `-o PATH`: The path for the python "pickle" file to store the result in. * `-s DATE`: The start date for the download in `YYYY-MM-DD` format; default of 2017-01-01. * `-e DATE`: The end date in `YYYY-MM-DD` format; default of 2018-01-01. * `-t TABLE`: The table in the database from which to obtain the wanted ICP ids; default is `public.icp_sample`, a table which contains 1000 ICPs with good data for 2017. **Important**: Don't assume that SQL injection can't come through this vector, although I have constrained the values that this script will accept from the command line to the following list: * `public.best_icp`, All icps with at least 360 days of data in 2017 * `public.best_icp_1618`, All icps with at least 720 days of data in 2 years from 1 April 2016 * `public.best_icp_18m`, All icps with at least 540 days of data from July 2016 to end of 2017 * `public.icp_sample`, A pre-generated 1k sample from best_icp * `public.icp_sample_5k`, A pre-generated 5k sample from best_icp * `public.icp_sample_1618`, A pre-generated 1k sample from best_icp_1618 * `public.icp_sample_18m`, A pre-generated 1k sample from best_icp_18m * `-n NUM`: The algorithm downloads the dataset in pieces, optimises them to reduce storage space, and reassembles. This option defines the number of such pieces; it should always be less than the number of days between the start and end days. Default of 12. * `--no-pivot`: This option can probably be ignored, as it downloads the dataset in a less efficient non-"pivoted" form, which was used in the original versions of some of these scripts. * `-v`: Output some extra progress information as it goes; mostly useful for debugging. Example: ```bash python downkwh.py -o ../data/test1k.pkl -n 24 ``` Downloads data from the default period into `../data/test1k.pkl` with 24 segments used. ### `downweather.py` Downloads weather (temperature and humidity) data from the database, from one specified station. * `-o PATH`: The path for the python "pickle" file to store the result in. * `-s DATE`: The start date for the download in `YYYY-MM-DD` format; default of 2016-04-01. * `-e DATE`: The end date in `YYYY-MM-DD` format; default of 2019-01-01. * `--station`: The station to download from; default is 2006 which is located near Pukekohe. * `-v`: Output some extra progress information as it goes; mostly useful for debugging. Example: ```bash python downweather.py -o ../data/weathertest.pkl ``` Downloads data from the default period into `../data/weathertest.pkl`. ### `dcorr.py` Create a distance correlation matrix from pickled pandas dataframe. Note that this takes some time, but has a progress bar. * `-i PATH`: The path for the python "pickle" file retrieve the original data from. * `-o PATH`: The path for the python "pickle" file to store the result in. * `-p`: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour. Example: ```bash python dcorr.py -i ../data/test1k.pkl -o ../data/test1kdcorr.pkl ``` Calculates correlation matrix from the dataset previously downloaded to `../data/test1k.pkl` and store in `../data/test1kdcorr.pkl`. ### `clustering.py` Algorithm for hierarchical clustering. **Note**: different datasets and parameters may make similar clusters in a different order. * `-i PATH`: The path for the python "pickle" file retrieve the original data from. * `-o PATH`: The path for the python "pickle" file to store the result in. * `--method METHOD`: Clustering method for hierarchical clustering; default is 'ward'. See [the scipy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) for more information and other options. * `--clusters NUM`: The number of clusters; default is 9. * `-d`: Include dendrogram; requires `-t/--tree`. * `-t PATH`: Path to save dendrogram image in if `-d` * `-v`: Output some extra progress information as it goes; mostly useful for debugging. Example: ```bash python clustering.py -i ../data/test1kdcorr.pkl -o ../data/test1kclustable.pkl -d -t ../img/test1kdendro.png ``` Clusters data from `../data/test1kdcorr.pkl`; saves table of clusters/icps at `../data/test1kclustable.pkl`, save dendrogram at `../img/test1kdendro.png`. ### `agg.py` Aggregates data based on clusters. **Note**: columns `CI_low` and `CI_high` do not represent formal statistical confidence intervals, merely lower and upper quartiles; for indicative purposes only. * `-i PATH`: The path for the python "pickle" file retrieve the original data from (i.e. the data downloaded from `downkwh.py`). * `-c PATH`: The path for the python "pickle" file retrieve the cluster data from. * `-o PATH`: The path for the python "pickle" file to store the result in. * `-d`: Drop cluster '-1', which represents the miscellaneous/unclustered pseudo-cluster produced by `clusAssign.py`. * `-p`: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour. Example: ```bash python agg.py -i ../data/test1k.pkl -c ../data/test1kclustable.pkl -o ../data/test1kagg.pkl ``` Aggregates data from `../data/test1k.pkl` by cluster information in `../data/test1kclustable.pkl`, save in `../data/test1kagg.pkl`. ### `pickletocsv.py` Helper function to transform a pickle into a csv file, for easier importing into e.g. Excel. * `input`: The path for the python "pickle" file which contains the dataset. * `output`: The path for the csv file to store the dataset in; if omitted, or `-`, prints to `stdout`. * `-r`: Include row names/index labels in csv. This may be essential for proper exporting of some datasets * `-v`: Output extra information, including dimensions of dataset. Examples: ```bash python pickletocsv.py ../data/test1kagg.pkl | less ``` Reads file at `../data/test1kagg.pkl` and views it in the UNIX pager `less`. ```bash python pickletocsv.py ../data/test1kagg.pkl - | tabview - ``` Reads the same file and views it using the `tabview` python module (included in `requirements.txt`). `-` in this case is shorthand for `stdout` and `stdin` respectively, allowing the pipe. ### `clusAssign.py` Assigns clusters found from one dataset to the values of another. **Note**: this algorithm can assign some ICPs to cluster -1, which means that it failed to assign to a cluster. **Further note**: this method requires both datasets to be on the same timespan. * `-i PATH`: The path for the file which contains the dataset. * `-a PATH`: The path for aggregated dataset to compare to this one. * `-c PATH`: Path for the output. * `-t NUM`: Correlation threshold for clustering: if best correlation is less than this number the icp will be assigned to cluster -1. Default -1, can be any value greater than or equal to -1 and less than 1. Example: ```bash python downkwh.py -o ../data/test1kb.pkl -t public.icp_sample_18m python clusAssign.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -a ../data/test1kagg.pkl -t 0.1 python agg.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -o ../data/test1kbagg.pkl -d ``` Downloads new dataset from the `public.icp_sample_18m` sample and saves it to `../data/test1kb.pkl`. Then assigns clusters to this (excluding the misc/'-1' cluster) from the `../data/test1kagg.pkl` dataset with threshold 0.1, saving into `../data/test1kbclustertable.pkl`. Then aggregates this dataset and saves in `../data/test1kbagg.pkl`. ### `pymodels.py` This script is a rewrite of the below `weathmod.R` and `combmodels.R` `R` scripts. It fits a harmonic model to an aggregated dataset. **Note**: the model file created is quite large (on the order of 500MB). This could probably be pruned down. * `-i PATH`: Path of the file that contains the (aggregated) dataset to fit models to. * `-w PATH`: Path of the weather data involved in the model. * `-m MODEL_FILE`: Filename to save the model to, as pickle. Note: this is not the same kind of pickle that `pickletocsv.py` can read. * `--weather-harmonics NUM`: Number of harmonics (with base period of 1 year/365.25 days) to fit to the weather data; default is 2. Adding more harmonics leads to a more complicated model which may be more powerful but may also "overfit." * `--icp-harmonics NUM NUM NUM`: (3 values) Number of harmonics of base period 1 year, 1 week, 1 day to fit, respectively. Default is 2, 3, and 3. Adding more harmonics leads to a more complicated model which may be more powerful but may also "overfit." Example: ```bash python pymodels.py -i ../data/test1kagg.pkl -w ../data/weathertest.pkl -m ../models/testmod.pkl ``` Fit all clusters in `../data/test1kagg.pkl` with weather data in `../data/weathertest.pkl` and save to `../models/testmod.pkl`. ### `predict.py` Predict unobserved demand values for given cluster, time period, supplying either maximum/minimum temperatures (as in the shiny app) or with a weather dataset. * `-m MODEL_FILE`: Filename to retrieve the model from, as pickle. * `-w WEATHER_FILE`: Path to weather data. This is optional, but if not specified the temperature parameter should be. * `-o OUTPUT_FILE`: File to output to. If `-` or absent prints to `stdout`. * `-t TEMP TEMP`: (2 values) If not supplying a weather file, can specify a minimum overnight and maximum daytime temperature value, similar to the shiny app. * `-s START_DATE`: The start date for the prediction interval, in `YYYY-MM-DD` format. * `-e END_DATE`: The end date for the prediction interval, in `YYYY-MM-DD` format. * `-c CLUSTER`: The cluster to be predicted. * `--pkl`: Output as a pickled dataframe rather than a csv file. Examples: ```bash python predict.py -m ../models/testmod.pkl -s 2018-01-01 -e 2018-02-01 -w ../data/weathertest.pkl -c 1 | tabview - ``` For cluster 1, model `../models/testmod.pkl`, weather data `../data/weathertest.pkl`, predict per ICP demand for the month of Jan 2018 and view in `tabview` viewer. ```bash python predict.py -m ../models/testmod.pkl -t 5 10 -s 2019-07-01 -e 2019-07-02 -c 1 | tabview - ``` For cluster 1, model `../models/testmod.pkl`, minimum overnight temperature 5 degrees C, and maximum temperature 10 degrees C, predict per ICP demand for the first of July 2019. ## R The scripts in `R/` include visualisers for the data, and for the creation of some models. ### Installing R To install R on an ec2 instance you need to run the following: ```bash sudo yum install R sudo yum install R-devel sudo yum install libcurl-devel sudo ln -s /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so sudo ln -s /usr/lib64/libquadmath.so.0 /usr/lib64/libquadmath.so ``` To install the packages needed for these scripts you should run the following from within `R` itself: ```R install.packages(c("dplyr", "tidyr", "ggplot2", "forecast", "TSA", "reticulate", "caTools", "scales", "argparse")) ``` ### `clusterviz.R` This script visualises the patterns of each cluster in an aggregated file, saving plots in a directory. Unfortunately, when loading directly from a pickle (with the `--pkl` option) this requires the python module `pandas` to have been installed outside of a virtual environment, with `pip install --user pandas`. The script is called with `Rscript clusterviz.R [options]`. * `cluster_file`: Source csv or pickle for visualisation. * `-i IMG_PATH`: Path for directory to store plots; default: `../img/`. * `-p POSTFIX`: Postfix for image files, identifying this "batch" of clusters; default: `_plot`. * `--width WIDTH`: Width of plot output in cm * `--height HEIGHT`: Height of plot output in cm * `--pkl`: Load directly from a pickle instead of a csv file. Example: ```bash python py/pickletocsv.py data/test1kbagg.pkl data/test1kbagg.csv Rscript R/clusterviz.R data/test1kbagg.csv -p "_1kb" --img img/ ``` Create csv for the pickle created by `clusAssign.py`, then create visualisation plots for all clusters and postfix them with `_1kb` (as in `all_fre_1kb.png`). ### `fftest.R` This is a now-redundant script testing fitting repeating patterns with Fourier transforms. ### `weathmod.R` This script takes the downloaded weather data, (hard-coded path: `../data/2016-18-weather.pkl`) and calculates running last-24-hour minimum and maximum temperatures. It then fits harmonic regressions using the first and second harmonic to de-trend the yearly pattern of each series, saving the resulting dataset at `../data/weatherharm.csv` and stores the models in `../models/weatherMinharmonic.rds` and `../models/weatherMaxharmonic.rds`. These paths can be changed in the source (and will likely need to be as the data files are not included in the repository), but there are no command line options. Commented out in the source are a number of potentially informative plots showing this process. For example: ![Detrended maximum daily temperature](img/tempdetrend-max.png) Black is observed values; blue is harmonic trend; green is residuals/de-trended data. The justification for this is that we want to include the deviation from typical value for that time of year in our model, not the value itself. As written, this script assumes a python virtual environment located in `../venv/`. ### `combmodels.R` Taking `../data/9-clusters.agg.pkl` and `../data/weatherharm.csv` as inputs, this script first fits a harmonic model to cluster 1 and tests it with the assigned clusters in `../data/1617-agg.pkl`. It then fits models to all clusters and saves them collectively in `../models/1kmods.rds`. Similar to the previous, some plots are again commented out. As written, this script assumes a python virtual environment located in `../venv/`. ## Shiny App The [R Shiny](https://shiny.rstudio.com/) app is located in `shiny/app.R`. It requires the outputs from `weathmod.R` and `combmodels.R`, along with a couple of extra packages, installed with: ```R install.packages(c("shiny", "shinycssloaders")) ``` Explaining Shiny and how to properly install it on a server in order to have a nice web app on a domain behind https _etc_ would take more time than I have (email me!) but for an example of a simple, self-hosted app in action see [this project](https://shiny.petras.space/Emissions/) I made awhile ago ([source](https://git.petras.space/petra/Emissions)). The app in this project can be run with `Rscript app.R`; this will (after several seconds of start-up) launch the app at a random port (e.g. 5576). Navigating to this location in your browser (firewall permitting) should display the app. The "data and model" tab allows exploration of the data along with the model projection; plots can be saved by right-clicking. The "prediction" tab allows, after specification of the number of icps in each of nine clusters, the temperature range, and the month and day of the week, the projected demand pattern. The month is approximate because it models a "typical" day in the approximate middle of that month, without a specific date.