Pandas cheatsheet
Quick Reference for Pandas Functions in Python for Data Analytics and Machine learning projects.
Basic Pandas Commands for Data Loading, Exploration, and Manipulation
import pandas as pd #To import pandas library in notebook
df = pd.read_csv('filename.csv') or pd.read_csv(r’location/folder/filename.csv’) or pd.read_excel(’filename.xlsx’) #To Load data from a CSV file or Excel file into a DataFrame
df.head() or df.head(n) #Display the first 5 rows (by default) or size = n number of rows of the DataFrame
df.describe() #Generate descriptive statistics of the DataFrame
df['column_name'] or df.column_name OR df[’column_name][index_number] #Access a specific column OR access specific column’s row value
df[['col1', 'col2']] #Access multiple columns
df.loc[row_index] or df.loc[ : , 3] or df.loc[0, ‘column_name’] #Access a row by index or all “rows” and “4rth” column or 1st row and column_name column intersecting value
df.iloc[row_index] or df.iloc[[0,1,2],0] or df.iloc[-5:] #Access a row by integer position or 1st 2nd 3rd row and 1st column or last 5 rows with all columns
Choosing between loc
and iloc
¶
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So0:10
will select entries0,...,9
.loc
, meanwhile, indexes inclusively. So0:10
will select entries0,...,10
. loc can index any stdlib type: strings
df.isnull().sum() #Check for missing values in the DataFrame
df.set_index(”Title”) #This is useful if you can come up with an index for the dataset which is better than the current one
df.dropna() #Remove rows with missing values
df.fillna(value) #Fill missing values with a specific value
df['column'] = df['column'].astype('int') # Convert column data type
df.groupby('column').mean() # Group by column and calculate the mean
df.sort_values('column') # Sort DataFrame by a specific column
df.merge(other_df, on='key') # Merge DataFrames on a key column
pd.concat([df1, df2]) # Concatenate DataFrames
df['date'] = pd.to_datetime(df['date']) # Convert column to datetime
df.resample('M').mean() # Resample time-series data to monthly frequency
df.shape #(number of rows x number of columns)
df.duplicated() and df.duplicated().sum()
df.drop_duplicates() #Drops/deletes duplicate rows
CONDITIONAL SELECTION
df.loc[df.Country == “Italy”] #df.Country == “Italy” This operation produced a Series of
True
/False
booleans based on thecountry
of each record. This result can then be used inside ofloc
to select the relevant data.df.loc[(df.Country == “Italy”) & (df.points≥90)] # and &, or |
CONDITIONAL SELECTORS
isin() #
isin
lets you select data whose value "is in" a list of values . For eg df.loc[df.Country.isin([”Italy”,”France”])] #select wines only from Italy or Franceisnull() ,notnull() #These methods let you highlight values which are (or are not) empty (
NaN
). For eg df.loc[df.price.notnull()]
ASSIGNING DATA
- df[”Country”] =”India” #Every value of column is changed to “India”
SUMMARY FUNCTIONS
df.country.describe() #This method generates a high-level summary of the attributes of the given column. for only NUMERICAL DATA
df.column_name.mean()
df.col_name.unique() #To see a list of unique values we can use the
unique()
functiondf.col_name.value_counts() #To see a list of unique values and how often they occur in the dataset, we can use the
value_counts()
method: