Week 1: Lecture Syllabus

Focus

Interactive lectures on introductory topics in data science will comprise the mronings of the first week of the workshop. The first three lectures will be led by Isabell Konrad and will cover classification, regression, clustering and neural networks. Thursday’s lecture by Yinshan Zhao will cover hypothesis testing and experimental design, while Friday’s lecture by Michael Reid will showcase modern software tools for data scientists.

For the first week schedule, please visit this page; and for first week notes, please visit the GitHub repository for the workshop.

Syllabus

Linear and Logistic regression.

We will understand the first machine learning algorithm Linear Regression. We talk about train-test-split, normalization of the data, the problem of overfitting and regularization. A short code example shows how to use these tools in practice (using scikit-learn). We introduce classification in contrast to regression, and discuss key metrics like accuracy, precision and recall. We talk about the Logistic Regression classifier and show a short code example.

Notes and Jupyter notebooks for this content are available on the Github repository.

Important classifiers.

The classifiers Decision Trees and K-Nearest-Neighbours will be covered, as well as ensemble methods like Bagging and Random Forests. In a short code example the application of the classifiers is demonstrated (scikit-learn).

Notes and Jupyter notebooks for this content are available on the Github repository.

Neural Networks.

This lecture is about Neural Networks with various add-ons, as well as feature engineering and a short code example in TensorFlow.

Notes and Jupyter notebooks for this content are available on the Github repository.

Hypothesis testing and experimental design.

Firstly, we will discuss the purpose of hypothesis testing and learn how to do it. Through examples, we will introduce basic concepts, testing procedures, common misinterpretations and connection between hypothesis testing and estimation. Then we will discuss statistical considerations in study designs, such as sampling, controls, randomization, sample size and power. Finally, we will look into multiple testing procedures.

Modern Software Tools for Data Scientists.

This lecture will cover the “plumbing” of data science — the software used for data ingestion, cleaning, storage, and analysis. We will provide an overview of the tools currently used by enterprises ingesting hundreds of thousand events per second and handling petabytes of data. Attendees will have a hands-on introduction to the machine learning tools used in industry, including distributed processing with AWS and Apache Spark. We also look into the future at the new generation of native stream processing platforms, used for real-time machine learning.

Slides from this lecture are available here.