To train a model, we need to feed it Data. Data which is “cleaned” and “appropriate” for training our machine. Appropriate data means it is “only that much” that is required, consisting of no duplicate values, and irrelevant data(row or column or Null values).
Role of Data analyst is to “collect”, “clean”, ”analyze” and “visualize” the data. Here EDA comes at 2nd step. And Data scientist’s work start after visualization, means making models , that can be trained on prepared data & insights, and then integrating the model into website or application or making User interface, so others can interact with it.
Flow of operation is as follows:
Importing libraries and reading data
Data Understanding (head, tail, shape, columns, describe, info, dtypes)
Data preparation (cleaning, filtering colms by need, converting dtypes, renaming colms, dropping irrelevant rows and columns and indexing, dropping duplicate rows, etc)
Feature understanding (EDA types applied here now, on each feature (univariate) - kde, histogram, box plot)
Feature relationships (on multiple features(multivariate)-scatterplot, pairplot, correlation heatmap, groupby)
EDA types
Univariate non-graphical (one-variable)
Univariate graphical (stem-and-leaf plots, histograms, box plots)
Multivariate non-graphical (when dealing with causes or relationships)
Multivariate graphical (Grouped bar plot or bar chart, scatter plot ,multivariate chart, run chart,bubble chart, heat map)
Languages used to do EDA
Python
R
Resources :
- EDA - https://hex.tech/templates/exploratory-analysis/exploratory-data-analysis/ (only starting, not much)
ML lifecycle
Machine Learning Lifecycle: Key Stages
The lifecycle of a machine learning project follows a structured workflow to transform raw data into actionable insights. Here’s a simplified breakdown:
Data Preparation
Goal: Curate and refine datasets for training.
Process: Source or build datasets, clean inconsistencies, handle missing values, and ensure data represents real-world scenarios. This phase often consumes the most time, as high-quality data is foundational to model success.
Model Training & Testing
Goal: Develop a predictive model using algorithms.
Process: Train models on prepared data, validate their accuracy with unseen test data, and iteratively refine hyperparameters (e.g., learning rate, layers) to optimize performance.
Model Optimization & Packaging
Goal: Prepare the model for integration.
Process: Containerize the model (e.g., using Docker) to ensure compatibility with deployment environments, and optimize it for scalability and latency.
Validation Against Business Goals
Goal: Align model performance with real-world needs.
Process: Evaluate metrics (accuracy, precision, recall) and test in simulated business scenarios. A model may excel technically but fail to address practical requirements.
Iterative Refinement
Goal: Improve model robustness.
Process: Repeat training cycles—adjusting data, algorithms, or hyperparameters—until the model meets predefined success criteria.
Deployment
Goal: Integrate the model into production systems.
Process: Deploy to environments like cloud platforms (AWS, Azure), edge devices (IoT sensors), or on-premises servers, depending on latency and scalability needs.
Monitoring & Retraining
Goal: Maintain model relevance over time.
Process: Continuously track performance metrics (e.g., drift detection) and retrain the model with fresh data to adapt to evolving trends or inputs.
Why It Matters
This iterative process ensures models remain accurate, reliable, and aligned with business objectives. Skipping steps like validation or monitoring often leads to "black box" models that underperform in real applications.