Skip to content

Latest commit

 

History

History
184 lines (133 loc) · 6.98 KB

Pandas.md

File metadata and controls

184 lines (133 loc) · 6.98 KB

Using Pandas

Pandas is a Python library that is used for data analysis and manipulation.

In SWITCH, Pandas is mainly used to create graphs and also output files after solving.

This document gives a brief overview of key concepts and commands to get started with Pandas. There are a lot better resources available online teaching Pandas, including entire online courses.

Most importantly, the Pandas documentation and API reference should be your go-to when trying to learn something new about Pandas.

Key Concepts

DataFrame

Dataframes is the main Pandas data structure and is responsible for storing tabular data. Dataframes have rows, columns and labelled axes (e.g. row or column names). When manipulating data, the common practice is to store your main dataframe in a variable called df.

Series

A series can be thought of as a single column in a dataframe. It's a 1-dimensional array of values.

Indexes

Pandas has two ways of working with dataframes: with or without custom indexes. Custom indexes are essentially labels for each row. For example, the following dataframe has 4 columns (A, B, C, D) and a custom index (the date).

                   A         B         C         D
2000-01-01  0.815944 -2.093889  0.677462 -0.982934
2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
2000-01-03 -0.527520  0.314343  0.852414 -1.348821
2000-01-04  0.133422  3.016478 -0.443788 -1.514029
2000-01-05 -1.451578  0.455796  0.559009 -0.247087

The same dataframe can be expressed without the custom index as follows. Here the date is a column just like the others and the index is the default index (just the row number).

        date         A         B         C         D
0 2000-01-01  0.815944 -2.093889  0.677462 -0.982934
1 2000-01-02 -1.688796 -0.771125 -0.119608 -0.308316
2 2000-01-03 -0.527520  0.314343  0.852414 -1.348821
3 2000-01-04  0.133422  3.016478 -0.443788 -1.514029
4 2000-01-05 -1.451578  0.455796  0.559009 -0.247087

Using custom indexes is quite powerful but more advanced. When starting out it's best to avoid custom indexes.

Chaining

Every command you apply on a dataframe returns a new dataframe. That is commands do not modify the dataframe they're called on.

For example, the following has no effect.

df.groupby("country")

Instead, you should always update your variable with the returned result. For example,

df = df.groupby("country")

This allows you to "chain" multiple operations together. E.g.

df = df.groupby("country").rename(...).some_other_command(...)

Useful commands

  • df = pandas.read_csv(filepath, index_col=False). This command reads a csv file from filepath and returns a dataframe that gets stored in df. index_col=False ensures that no custom index is automatically created.

  • df.to_csv(filepath, index=False). This command will write a dataframe to filepath. index=False means that the index is not written to the file. This should be used if you're not using custom indexes since you probably don't want the default index (just the row numbers) to be outputted to your csv.

  • df["column_name"]: Returns a Series containing the values for that column.

  • df[["column_1", "column_2"]]: Returns a DataFrame containing only the specified columns.

  • df[df["column_name"] == "some_value"]: Returns a dataframe with only the rows where the condition in the square brackets is met. In this case we filter out all the rows where the value under column_name is not "some_value".

  • df.merge(other_df, on=["key_1", "key_2"]): Merges df with other_df where the columns over which we are merging are key_1 and key_2.

  • df.info(): Prints the columns in the dataframe and some info about each column.

  • df.head(): Prints the first few rows in the dataframe.

  • df.drop_duplicates(): Drops duplicate rows from the dataframe

  • Series.unique(): Returns a series where duplicate values are dropped.

Note on reading switch files

When reading SWITCH csv files, it is recommended to use the following arguments in pd.read_csv().

  • index_col=False. This forces Pandas to not automatically use the first column as an index to ensure you are not using custom indexes (See notes on custom indexes above).

  • dtype={"GENERATION_PROJECT": str}: If all the generation project IDs happen to be numbers, then Pandas will automatically set the GENERATION_PROJECT column type to int. However, we don't want this since this may cause issues when dealing with multiple dataframes, some of which have non-numeric IDs. (E.g. if you try merging a Dataframe where GENERATION_PROJECT is an int with another where it's a str, it won't work properly.)

  • dtype=str: An even safer option than dtype={"GENERATION_PROJECT": str} is dtype=str instead. This is particularly important when reading a file that will than be re-outputed with minimal changes. Without this option, there's the risk of floating point values being slightly modified (see here) or integer columns containing na values (.) being "promoted" to floats. Note that with dtype=str, all columns are strings so to do mathematical computation on a column it will first need to be converted with .astype().

  • na_values=".". Switch uses full stops to indicate an unspecified value. We want Pandas to interpret full stops as NaN rather than the string . so that the column type is still properly interpreted rather than being detected as a string.

Combining these parameters, here is an example of how to read a switch file.

df = pd.read_csv("some_SWITCH_file.csv", index_col=False, dtype={"GENERATION_PROJECT": str}, na_values=".")

Example

This example shows how we can use Pandas to generate a more useful view of our generation plants from the SWITCH input files.

import pandas as pd

# READ
# See note above on why we use these parameters
kwargs = dict(
  index_col=False,
  dtype={"GENERATION_PROJECT": str},
  na_values=".", 
)
gen_projects = pd.read_csv("generation_projects_info.csv", **kwargs)
costs = pd.read_csv("gen_build_costs.csv", **kwargs)
predetermined = pd.read_csv("gen_build_predetermined.csv", **kwargs)

# JOIN TABLES
gen_projects = gen_projects.merge(
  costs,
  on="GENERATION_PROJECT",
)

gen_projects = gen_projects.merge(
  predetermined,
  on=["GENERATION_PROJECT", "build_year"],
  how="left"  # Makes a left join
)

# FILTER
# When uncommented, this line will filter out all the projects that aren't wind.
# gen_projects = gen_projects[gen_projects["gen_energy_source"] == "Wind"]

# WRITE
gen_projects.to_csv("projects.csv", index=False)

If you run the following code snippet in the inputs folder it will create a projects.csv file containing the project data, cost data and prebuild data all in one file.