Back to Cheatsheets

pandas

Table of Contents

1. GETTING STARTED

2. CREATE

3. EXPORT

4. SELECT DATA

5. MODIFY

6. AGGREGATES

7. MULTIPLE DATAFRAMES

8. DATA CLEANING

9 REGULAR EXPRESSIONS

GETTING STARTED

Import

import pandas as pd

View

Creating CSV

column1,column2,column3
value1,value2,value3

CREATE

DataFrame - an object that stores data as rows and columns (like a spreadsheet or SQL table)

Using dictionary

df1 = pd.DataFrame({
'column_name1': ['value1', 'value2'],
'column_name2' : ['value3', 'value4']
})

Using lists

df2 = pd.DataFrame([
['row1_value1', 'row1_value2', 'row1_value3'],
['row2_value1', 'row2_value2', 'row2_value3'],
],
columns=['column1', 'column2', 'column3'
])

Using CSV

pd.read_csv('file_name.csv', 
                names=[column_names_when_not_given],
                index_col=0)

EXPORT

Write to CSV

df.to_csv('new_file_name.csv')

SELECT DATA

1 column

df[key]
or
df.column_name

Multiple columns

new_df = other_df[['col_name1', 'col_name2']]

1 row

df.iloc[index]

Multiple rows

df.iloc[start:end]

Rows with logic

df[(df.col_name1 >= num) & (df.col_name1 <= num2)]
df[df.col_name.isin(['value1', 'value2'])]

Reset Indices

df.reset_index(inplace=True, drop=True)

Null

MODIFY

Add Columns

df['col_name'] = ['value`', 'value2']

Rename Columns

df.columns = ['new_col1', 'new_col1']
df.rename(columns = {
    'old_col': 'new_col1'
},
inplace=True)

Column Operations

df['col_name'] = df.col_name.apply(___)
df['new_col'] = df.col_name.apply(lambda)

Row operations

df['new_col'] = df.apply(lambda, axis=1)

Pivot Table

df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')

Sort

df.sort_values("column_name").reset_index(drop = True)

AGGREGATES

1 column

df.column_name.measurement()

groupby 1 column

df.groupby('column1').column2.measurement()

groupby for percentile

df.groupby('col1').col2
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()

groupby multiple columns

df.groupby(['col1', 'col2']).col3.measurement().reset_index()

MULTIPLE DATAFRAMES

Inner Merge

pd.merge(df1, df2)

Merge on Specific Columns

pd.merge(df1, df2.rename(columns={'current_col': 'other_col_name'}))
pd.merge(df1, df2, 
    left_on='df1_col', 
    right_on='df2_col'
    suffixes=['_df1samecol', '_df2samecol])

Outer Join

pd.merge(df1, df2, how='outer')

Left Merge

pd.merge(df1, df2, how='left')

Right Merge

pd.merge(df1, df2, how='right')

Concatenate

pd.concat([df1, df2, df2, ...])

DATA CLEANING

Open files

import glob

files = glob.glob('file*.csv')

Combile files

df_list = []
for filename in files:
  data = pd.read_csv(filename)
  df_list.append(data)

df = pd.concat(df_list)

Reshape table

pd.melt(
    frame=df, 
    ids_vars='df_col_keep', 
    value_vars='df_col_to_var', 
    value_name='new_col_values', 
    var_name='new_col_vars')

Duplicates

Split by Index

df['new_col'] = df.col.str[indices]

Split by character

str_split = df.col_name.str.split('_')
df['new_col'] = str_split.str.get(0)

Types

Convert to Number

Remove regex

Missing Values

REGULAR EXPRESSIONS

Literals

Alternation

Character Sets

WildCards

### Ranges

Shorthand Character classes

Groupings

Fixed Quantifiers

Optional Quanitifies

Kleene

Anchors