This is Part 1 of a two-part blog series (Part 2 can be found here).
Introduction
A few months ago, I decided it was time for me to consolidate my knowledge of Deep Learning. I had dabbled in different aspects of the field over the years, but never went through a more formal, rigorous process, and was looking to address that. In researching different options, the three that kept coming up were Ian Goodfellow’s Deep Learning book, Jeremy Howard’s fastai course, and Andrew Ng’s Deep Learning Specialization course. However, I found a lack of consensus on the best starting point in the field. In this two-part blog series, I aim to fill that gap by sharing my insights on the course I eventually chose: Andrew Ng’s Deep Learning Specialization.
Motivation
Rather than reiterating the significance of Deep Learning (just open any news outlet, or have a look at the 76% increase in Microsoft’s market cap since the release of ChatGPT…), I will focus on my personal motivations for undertaking this course:
1. Deep Learning holds the “predictive” crown
In my experience as a Data Scientist, the crux of the problem often revolves around data availability, agreeing on objectives, etc. Recently, however, I’ve found it invaluable to take a step back and align with stakeholders on the core objective of a data science problem: are we seeking inference, or are we seeking prediction? Most problems fall into one of these two categories. Lines can occasionally be blurred, but aligning on which takes the forefront is important. When it comes to “pure” prediction problems, where model interpretation is secondary, neural networks take the crown, if data is plentiful. Why limit oneself to interpretable statistics and machine learning models, if so?
2. Deep Learning’s emergent properties
Emergent properties are “properties that become apparent and result from various interacting components within a system but are properties that do not belong to the individual components themselves”. Looking at a perceptron or any other simple components of a neural network individually, you’d never guess we could recognise faces, run neural style transfer, or talk to LLMs. The same can be said of neurons in our brains. Deep Learning architectures, much like our brains, have demonstrated emergent properties time and time again. Academics argue as to where this will end; Judea Pearl being a strong believer that causality is a separate beast, for example. I, much like all of us, also cannot provide a definitive answer. But, in the meantime, Deep Learning has already shown its value, and shocked us a time or two before, so why not give it a go?
3. Design flexibility
Compared to traditional statistical and ML techniques, neural networks offer infinite design possibilities. The world truly is your oyster when it comes to designing new architectures. In some ways, this can be daunting and challenging to work with. In others, this can provide a real creative outlet for data scientists to experiment with. After all, many of the gold standard techniques we use today came about from good old fashioned empirical experimentation. Personally, I find this exciting - but maybe it’s just appealing to the pragmatism 4 years of Engineering school hammered into me…
Course Content
The Deep Learning Specialization is structured into five courses:
- Neural Networks and Deep Learning
- Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
- Structuring Machine Learning Projects
- Convolutional Neural Networks
- Sequence Models
The first three courses, which I will discuss in this post, lay the foundation of Deep Learning. The principles covered here apply to any neural network you might train, regardless of the architecture type, or data modality. The final two courses delve into two critical applications of deep learning: Convolutional Neural Nets & Computer Vision, and Sequence Models, which I will cover in Part 2. So, what do the first three courses entail?
Course 1
Course 1, Neural Networks and Deep Learning, begins by introducing neural networks based on simple machine learning algorithms like logistic regression. It then delves into the intuition behind these models, introduces gradient descent, and explains how it fits into one forward/backward propagation learning pass. The iterative nature of training efficient neural networks is emphasized from the outset.
Course 2
Course 2, Improving Deep Neural Networks: Hyperparameter Tuning, Regularization, and Optimization, starts with a comprehensive discussion on various regularization techniques such as L1, L2 (weight decay), dropout regularization, data augmentation, and early stopping. The pros and cons of minibatching are introduced, which segways nicely into optimisation techniques, discussing gradient descent, exponentially weighted moving averages, and tying these together to describe gradient descent with momentum. RMSProp, ADAM, and learning rate decay are also discussed. The course takes the time to walk through the nuances associated with these techniques, such as exploding/vanishing gradients, and introduces gradcheck. More qualitative discussions are interspersed between technical topics. For example, Ng introduces the unconventional concept of “orthogonalization,” encouraging us to view different goals (such as minimising bias vs minimising variance) in building a neural network as orthogonal, and to treat them as such. Granted, in a Deep Learning model, the direct correlation between these two properties might be less obvious than in more traditional techniques, but at least to me, that seemed far fetched.
Course 3
Course 3, Structuring ML Projects, is where Ng shares his practical insights from decades of experience in applied Deep Learning projects. This course extends to any statistical/ML techniques one might implement, covering topics like evaluation metrics, error analysis, mislabeled data, and data mismatch. It also introduces transfer learning, fine-tuning, multi-task learning, and end-to-end deep learning. Of the three, this is the course I would most recommend.
Reflections
Overall, my experience with the Deep Learning Specialization course has been highly rewarding. Initially, I was hesitant due to the course’s choice of TensorFlow as its Deep Learning framework, a selection that seemed unconventional given the rising popularity of PyTorch (see below…)
However, my apprehensions were quickly relieved when I discovered that the majority of the work from the first three courses is entirely library-agnostic, carried out manually using vectors and matrices. Diving into this level of nitty-gritty details and avoiding abstractions is, in my experience, the best way to learn and a key selling point to Ng’s course. The popular FastAI course introduced earlier, for example, takes a very different approach.
Course 3 offered insights that I will carry forward to any future data science work. I particularly enjoyed Ng’s recommendations on evaluating projects based on a blend of “optimizing” and “satisficing” metrics. Often, when working on a project for stakeholders, there are lengthy and complex discussions about which model is “best”. How do we evalute this? Numerous evaluation metrics are considered, each with varying degrees of importance, leading to debates about which model emerges as the “winner”. Ng’s pragmatic approach suggests agreeing on a single metric to optimize and a list of metrics to “satisfice”. Examples of the latter could include prediction times, cost of retraining, or false negatives. The former could then be a single metric that, assuming all nuanced metrics from the “satisfice” list are met, you want to optimize. I have since experimented with this framework and have seen fantastic results.
Lastly, as monoliths like Meta continue to release massive open-source networks such as Llama7B, and most other companies lack the resources to train such large models, the key skill in the Deep Learning space will be the ability to adapt such models to one’s specific use case. The AI victors outside of large businesses like Microsoft and Meta will be those that can leverage these models effectively. Techniques like transfer learning and more niche methods, such as RAG for LLMs, come to mind. Ng covers these at length, and I believe their importance is only on the rise, given how the field is playing out. More on that in Part 2.