Pandas cheatsheet reference guide

Basic Pandas Commands for Data Loading, Exploration, and Manipulation

import pandas as pd #To import pandas library in notebook
df = pd.read_csv('filename.csv') or pd.read_csv(r’location/folder/filename.csv’) or pd.read_excel(’filename.xlsx’) #To Load data from a CSV file or Excel file into a DataFrame
df.head() or df.head(n) #Display the first 5 rows (by default) or size = n number of rows of the DataFrame
df.info() #Get a concise summary of the DataFrame
df.describe() #Generate descriptive statistics of the DataFrame
df['column_name'] or df.column_name OR df[’column_name][index_number] #Access a specific column OR access specific column’s row value
df[['col1', 'col2']] #Access multiple columns
df.loc[row_index] or df.loc[ : , 3] or df.loc[0, ‘column_name’] #Access a row by index or all “rows” and “4rth” column or 1st row and column_name column intersecting value
df.iloc[row_index] or df.iloc[[0,1,2],0] or df.iloc[-5:] #Access a row by integer position or 1st 2nd 3rd row and 1st column or last 5 rows with all columns

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9.
loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10. loc can index any stdlib type: strings

df.isnull().sum() #Check for missing values in the DataFrame
df.set_index(”Title”) #This is useful if you can come up with an index for the dataset which is better than the current one
df.dropna() #Remove rows with missing values
df.fillna(value) #Fill missing values with a specific value
df['column'] = df['column'].astype('int') # Convert column data type
df.groupby('column').mean() # Group by column and calculate the mean
df.sort_values('column') # Sort DataFrame by a specific column
df.merge(other_df, on='key') # Merge DataFrames on a key column
pd.concat([df1, df2]) # Concatenate DataFrames
df['date'] = pd.to_datetime(df['date']) # Convert column to datetime
df.resample('M').mean() # Resample time-series data to monthly frequency
df.shape #(number of rows x number of columns)
df.duplicated() and df.duplicated().sum()
df.drop_duplicates() #Drops/deletes duplicate rows

df.loc[df.Country == “Italy”] #df.Country == “Italy” This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data.
df.loc[(df.Country == “Italy”) & (df.points≥90)] # and &, or |

isin() #isin lets you select data whose value "is in" a list of values . For eg df.loc[df.Country.isin([”Italy”,”France”])] #select wines only from Italy or France
isnull() ,notnull() #These methods let you highlight values which are (or are not) empty (NaN). For eg df.loc[df.price.notnull()]

df.country.describe() #This method generates a high-level summary of the attributes of the given column. for only NUMERICAL DATA
df.column_name.mean()
df.col_name.unique() #To see a list of unique values we can use the unique() function
df.col_name.value_counts() #To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method: