Repository for Petra's work at ampli Jan-Feb 2019

README.md 12KB

ampli

Repository for Petra’s work at ampli Jan-Feb 2019

What isn’t in this repository

This repository does not contain with it downloaded data (for confidentiality and size reasons), the fitted models (similar issues), or the configuration file for downloading from the database (which contains a password). Recreate an empty data/ directory and models/. The configuration file is at py/database.ini and looks like:

[postgresql]
host=<Hostname>
database=<Database>
user=<Username>
password=<Password>

This file is based on an example from the postgresSQL website. Replace <Hostname>, <Database>, <Username> and <Password> with valid credentials.

SQL

I’ve included a bunch of annotated SQL queries in sql/queries.pgsql and sql/weather.pgsql. The latter file is specifically about the weather dataset; the former is more general. Some of the queries are copied into the py/util.py file, but changing these two files will do nothing on their own.

Python

requirements.txt contains the python packages required to set up a virtual environment with virtualenv -p /usr/bin/python3 venv and pip install -r requirements.txt. Notably these are:

  • numpy
  • pandas
  • pkg-resources
  • psycopg2-binary
  • pyparsing
  • python-dateutil
  • pytz
  • scipy
  • seaborn
  • statsmodels

Virtual environments are loaded with source venv/bin/activate. The python scripts are in the py/ folder. The scripts that are designed to be called directly are called by python <scriptname.py>; use python <scriptname.py> -h to view help. Note that most options will have a default, which may not be what you want, so always check.

util.py

This script is imported by several other scripts, particularly for downloading the data from the database.

downkwh.py

Downloads demand data from the database. Options:

  • -o PATH: The path for the python “pickle” file to store the result in.
  • -s DATE: The start date for the download in YYYY-MM-DD format; default of 2017-01-01.
  • -e DATE: The end date in YYYY-MM-DD format; default of 2018-01-01.
  • -t TABLE: The table in the database from which to obtain the wanted ICP ids; default is public.icp_sample, a table which contains 1000 ICPs with good data for 2017. Important: Don’t assume that SQL injection can’t come through this vector, although I have constrained the values that this script will accept from the command line to the following list:
    • public.best_icp, All icps with at least 360 days of data in 2017
    • public.best_icp_1618, All icps with at least 720 days of data in 2 years from 1 April 2016
    • public.best_icp_18m, All icps with at least 540 days of data from July 2016 to end of 2017
    • public.icp_sample, A pre-generated 1k sample from best_icp
    • public.icp_sample_5k, A pre-generated 5k sample from best_icp
    • public.icp_sample_1618, A pre-generated 1k sample from best_icp_1618
    • public.icp_sample_18m, A pre-generated 1k sample from best_icp_18m
  • -n NUM: The algorithm downloads the dataset in pieces, optimises them to reduce storage space, and reassembles. This option defines the number of such pieces; it should always be less than the number of days between the start and end days. Default of 12.
  • --no-pivot: This option can probably be ignored, as it downloads the dataset in a less efficient non-“pivoted” form, which was used in the original versions of some of these scripts.
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python downkwh.py -o ../data/test1k.pkl -n 24

Downloads data from the default period into ../data/test1k.pkl with 24 segments used.

downweather.py

Downloads weather (temperature and humidity) data from the database, from one specified station.

  • -o PATH: The path for the python “pickle” file to store the result in.
  • -s DATE: The start date for the download in YYYY-MM-DD format; default of 2016-04-01.
  • -e DATE: The end date in YYYY-MM-DD format; default of 2019-01-01.
  • --station: The station to download from; default is 2006 which is located near Pukekohe.
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python downweather.py -o ../data/weathertest.pkl

Downloads data from the default period into ../data/weathertest.pkl.

dcorr.py

Create a distance correlation matrix from pickled pandas dataframe. Note that this takes some time, but has a progress bar.

  • -i PATH: The path for the python “pickle” file retrieve the original data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • -p: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour.

Example:

python dcorr.py -i ../data/test1k.pkl -o ../data/test1kdcorr.pkl

Calculates correlation matrix from the dataset previously downloaded to ../data/test1k.pkl and store in ../data/test1kdcorr.pkl.

clustering.py

Algorithm for hierarchical clustering. Note: different datasets and parameters may make similar clusters in a different order.

  • -i PATH: The path for the python “pickle” file retrieve the original data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • --method METHOD: Clustering method for hierarchical clustering; default is ‘ward’. See the scipy docs for more information and other options.
  • --clusters NUM: The number of clusters; default is 9.
  • -d: Include dendrogram; requires -t/--tree.
  • -t PATH: Path to save dendrogram image in if -d
  • -v: Output some extra progress information as it goes; mostly useful for debugging.

Example:

python clustering.py -i ../data/test1kdcorr.pkl -o ../data/test1kclustable.pkl -d -t ../img/test1kdendro.png

Clusters data from ../data/test1kdcorr.pkl; saves table of clusters/icps at ../data/test1kclustable.pkl, save dendrogram at ../img/test1kdendro.png.

agg.py

Aggregates data based on clusters. Note: columns CI_low and CI_high do not represent formal statistical confidence intervals, merely lower and upper quartiles; for indicative purposes only.

  • -i PATH: The path for the python “pickle” file retrieve the original data from (i.e. the data downloaded from downkwh.py).
  • -c PATH: The path for the python “pickle” file retrieve the cluster data from.
  • -o PATH: The path for the python “pickle” file to store the result in.
  • -d: Drop cluster ‘-1’, which represents the miscellaneous/unclustered pseudo-cluster produced by clusAssign.py.
  • -p: If the dataframe is not in wide form, pivots it so that it is first. Note: untested and unused, representing past behaviour.

Example:

python agg.py -i ../data/test1k.pkl -c ../data/test1kclustable.pkl -o ../data/test1kagg.pkl

Aggregates data from ../data/test1k.pkl by cluster information in ../data/test1kclustable.pkl, save in ../data/test1kagg.pkl.

pickletocsv.py

Helper function to transform a pickle into a csv file, for easier importing into e.g. Excel.

  • input: The path for the python “pickle” file which contains the dataset.
  • output: The path for the csv file to store the dataset in; if omitted, or -, prints to stdout.
  • -r: Include row names/index labels in csv. This may be essential for proper exporting of some datasets
  • -v: Output extra information, including dimensions of dataset.

Examples:

python pickletocsv.py ../data/test1kagg.pkl | less

Reads file at ../data/test1kagg.pkl and views it in the UNIX pager less.

python pickletocsv.py ../data/test1kagg.pkl - | tabview -

Reads the same file and views it using the tabview python module (included in requirements.txt). - in this case is shorthand for stdout and stdin respectively, allowing the pipe.

clusAssign.py

Assigns clusters found from one dataset to the values of another. Note: this algorithm can assign some ICPs to cluster -1, which means that it failed to assign to a cluster. Further note: this method requires both datasets to be on the same timespan.

  • -i PATH: The path for the file which contains the dataset.
  • -a PATH: The path for aggregated dataset to compare to this one.
  • -c PATH: Path for the output.
  • -t NUM: Correlation threshold for clustering: if best correlation is less than this number the icp will be assigned to cluster -1. Default -1, can be any value greater than or equal to -1 and less than 1.

Example:

python downkwh.py -o ../data/test1kb.pkl -t public.icp_sample_18m
python clusAssign.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -a ../data/test1kagg.pkl -t 0.1
python agg.py -i ../data/test1kb.pkl -c ../data/test1kbclustertable.pkl -o ../data/test1kbagg.pkl -d

Downloads new dataset from the public.icp_sample_18m sample and saves it to ../data/test1kb.pkl. Then assigns clusters to this (excluding the misc/‘-1’ cluster) from the ../data/test1kagg.pkl dataset with threshold 0.1, saving into ../data/test1kbclustertable.pkl. Then aggregates this dataset and saves in ../data/test1kbagg.pkl.

R

The scripts in R/ include visualisers for the data, and for the creation of some models.

Installing R

To install R on an ec2 instance you need to run the following:

sudo yum install R
sudo yum install R-devel
sudo yum install libcurl-devel
sudo ln -s /usr/lib64/libgfortran.so.3 /usr/lib64/libgfortran.so
sudo ln -s /usr/lib64/libquadmath.so.0 /usr/lib64/libquadmath.so

To install the packages needed for these scripts you should run the following from within R itself:

install.packages(c("dplyr", "tidyr", "ggplot2", "forecast", "TSA", "reticulate", "caTools", "scales", "argparse"))

clusterviz.R

This script visualises the patterns of each cluster in an aggregated file, saving plots in a directory. Unfortunately, when loading directly from a pickle (with the --pkl option) this requires the python module pandas to have been installed outside of a virtual environment, with pip install --user pandas. The script is called with Rscript clusterviz.R [options].

  • cluster_file: Source csv or pickle for visualisation.
  • -i IMG_PATH: Path for directory to store plots; default: ../img/.
  • -p POSTFIX: Postfix for image files, identifying this “batch” of clusters; default: _plot.
  • --width WIDTH: Width of plot output in cm
  • --height HEIGHT: Height of plot output in cm
  • --pkl: Load directly from a pickle instead of a csv file.

Example:

python py/pickletocsv.py data/test1kbagg.pkl data/test1kbagg.csv
Rscript R/clusterviz.R data/test1kbagg.csv -p "_1kb" --img img/

Create csv for the pickle created by clusAssign.py, then create visualisation plots for all clusters and postfix them with _1kb (as in all_fre_1kb.png).

fftest.R

This is a now-redundant script testing fitting repeating patterns with Fourier transforms.

weathmod.R

This script takes the downloaded weather data, (hard-coded path: ../data/2016-18-weather.pkl) and calculates running last-24-hour minimum and maximum temperatures. It then fits harmonic regressions using the first and second harmonic to de-trend the yearly pattern of each series, saving the resulting dataset at ../data/weatherharm.csv and stores the models in ../models/weatherMinharmonic.rds and ../models/weatherMaxharmonic.rds. These paths can be changed in the source, but there are no command line options.

Commented out in the source are a number of potentially informative plots showing this process. For example:

Detrended maximum daily temperature

Black is observed values; blue is harmonic trend; green is residuals/de-trended data.

The justification for this is that we want to include the deviation from typical value for that time of year in our model, not the value itself.

combmodels.R

This script takes the