Introduction to Data Science: A Non-Technical Background
R track :: Goran S. Milovanović, PhD, DataKolektiv

PLEASE SEND YOUR NOTIFICATION OF INTEREST TO: goran.milovanovic@datakolektiv.com

COURSE WEBSITE/LEARNING MATERIAL
http://datakolektiv.org/app/introdsnontech

This course provides a comprehensive introduction to Data Science in the programming language R for those who want to get a grasp on the contemporary data magic and its application but have no technical background in coding, computer science, or statistics. This is a practical course which provides a minimal but explicable and useful theoretical foundation for those who want to enter Data Science from a non-technical perspective and still become able to work efficiently in its highly technical context.

PREREQUISITES

You can use a computer and know how to search the Internet for information. This course is planned and ideally suited for

• Those with a non-technical background who are interested to start a career in IT/Digital: Product, HR, Marketing, Communications
• Non-tech employees in the IT industry (administration, marketing, HR etc)
• Students and scholars in social sciences, arts, and humanities
• Researches with a background in qualitative methods

This is planned as you first course in Data Science and it still provides a practical, working knowledge that you can rely on to build research and analytics projects on your own.

Most important of all: you are motivated to learn Data Science in R. If you want it, you want it – and you will work hard to win it. And if you are motivated and prepared to invest hard work, we guarantee success.

We will support you in anything that needs to be done during the course in any of the following operative systems: Windows, Linux Ubuntu/Debian, and macOS. If any of the projects that you would like to develop during the course need rely on heavy computations and/or memory use, we will provide the technical infrastructure to you and teach you how to manage it (up to 64Gb of RAM, half TB (that would be 500 GB) of disk usage, and up to 24 AMD cores for computation).

GOALS

We want to help you to learn how to learn. The landscape of contemporary Data Science is truly a universe on its own, and no one manages to know it all: it is a world meant for those with curiosity and courage, armed with readiness to explore, learn, unlearn, and relearn. We want to help you to acquire the necessary coding skills together with a practical understanding of the fundamental concepts and principles. Thus only you will be able to put your curiosity to work and start exploring and specializing in Data Science without needing anyone to guide you but yourself. Near the end of this course we expect you to be already able to develop and manage a Data Science project on your own.

PLEASE SEND YOUR NOTIFICATION OF INTEREST TO: goran.milovanovic@datakolektiv.com

OVERVIEW

We begin with simple guides on how to install the necessary (free and open source) software on your machine. You will be using RStudio, a powerful Integrated Development Environment (IDE) for R, a nice piece of software which will make your work as comfortable and efficient as possible. RStudio is free and easy to install and manage.

Then we will teach you – very patiently – to write computer programs in the programming language R, the lingua franca of the contemporary Data Science, one step at the time. Even the very first programs that you will be writing encompass operations that are highly typical and important in Data Science.

In the following steps, we introduce mathematical concepts of probability, vector and matrix arithmetic, of essential importance in your future work. We provide a highly non-technical, conceptual introduction to the mathematical apparatus, with a plenty of examples for you to study and understand. Each time a new concept is introduced we study and exercise its representation and operation in the programming language R.

We provide an overview of where the data live in this World and how to get to them from R, an overview of all popular data formats and how to work with them, and focus a lot on data wrangling, a process in which Data Scientists clean and reshape their datasets in order to make them suitable for analytics, modeling, and visualization. In a very important chapter we learn how to work with strings and text in R and provide an introduction to regular expressions to process, clean, and correct textual information. We rely on a set of R packages jointly known as tidyverse, of course: it is a powerful standard to manage data in R that Data Scientist are very happy to use.

In the next step you will start producing your first analytics: data overviews, summaries, aggregates, and visualizations. We start introducing the framework of statistical testing and build together your first numerical simulations. The simulations will help you build an essential understanding of probability theory and the behavior of random variables. We will then focus on producing industry standard static visualizations in the R {ggplot2} package, carefully inspecting all important approaches to visual data representation and when to use them.

With the knowledge that you have gathered thus far, you will start building you first models, while being introduced to the essential concepts of learning error, optimization, cross-validation, regularization, and many more. Again, all mathematical concepts are introduced and explained in a highly non-technical manner, and we provide careful guidance while you build your understanding directly by applying each of them. We will go into the details of linear (simple and multiple) and logistic regression (binomial and multinomial) to learn how to deal with simple prediction and classification problems, and then study decision trees and random forests to be able to manage more difficult problems.

Finally, we will help you build your first Data Products, interactive reports with R Markdown that can be served as web-pages to your future clients and collaborators, supported by dynamic, interactive visualizations with Plotly. We will also show how to to build interactive maps and networks and serve them in your nicely designed reports. You will learn how to build a website in R Markdown in simple ways and serve as many reports from there as you need – so that you can start acting like an analytics firm on its own in no time at all.

COURSE CURRICULUM

The course takes 24 weeks (approximately six months) to complete. We expect you to be able to invest 6 – 8 hours of your time weekly. Each week we will organize a two to three hours lab and meet do learn and discuss new things. Ad hoc 1:1 sessions will be organized. The rest of the expected work is guided exercise, Q&A, labs, and independent project development. All learning material, code and exercises will be provided in a timely manner. If you choose to develop your own Data Science project in this course, support to work with the Git/GitHub version-control system will be provided to you.

Week 1
• Introduction to the RStudio IDE. First R programs. Installations: R, RStudio, R packages to work with. Learning how to organize your work and stay organized. Hello world. Working with elementary data structures in R.

Week 2
• Input/Output: Find data, load data, inspect data, and store data in R. Our first visualizations in {ggplot2}. Understanding data management with R and RStudio. Introduction to data formats (CSV and RDS files). What is a data.frame and how do we work with it? Lists, a lot of them. Strings.

Week 3
• More data formats and structures (simple things in JSON and XML) + introduction to control flow in R (loops, if … else, etc.). Why do we have different data formats and structures and how to put them to do their best for us? What is data.table? More strings.

Week 4
• Serious programming in R begins: functions, code vectorization, control flow. Data cleaning: working with strings and text. Introduction to regular expressions (regex).

Week 5
• Serious programming in R begins: vector and matrix arithmetic. Elaborated work with strings and text; more on regular expressions (regex).

Week 6
• Introduction to Exploratory Data Analytics: serious data visualization in R + {ggplot2} begins. Introduction to Probability Theory in R. What mathematical statistics are for?

Week 7
• Introduction to Probability Theory in R: building numerical simulations of random variables and visualizing their results. A philosophical discussion of probability and risk.

Week 8
• Serious data wrangling with {dplyr} and {tidyr} begins. Understanding the tidy principles of data management.

Week 9
• Serious data wrangling with {dplyr} and {tidyr} continues with a lot of exercises. What Relational Databases (RDBS) are and how do we connect to them. Prerequisites: installing MySQL on your local machine. A crash course in SQL: so similar to {dplyr}.

Week 10
• Serious data wrangling with {dplyr} and {tidyr} continues with a lot of exercises. Using {dplyr} to work with a database. Introduction to {data.table}: speed and power at your fingertips.

Week 11
• Mastering {data.table}: the essential operations on large datasets.

Week 12
• Introduction to Estimation Theory: statistics and parameters. Numerical simulations, visualizations. Understanding the logic of statistical modeling. Sampling. Introduction to correlation and Simple Linear Regression.

Week 13
• Introduction to Estimation Theory: statistics and parameters. Numerical simulations, visualizations. The logic of statistical modeling elaborated. Enters the Sim-Fit loop. Bias and Variance of a statistical estimate. Parametric bootstrap.

Week 14
• Introduction to Estimation Theory: statistics and parameters. Numerical simulations, visualizations. The logic of statistical modeling elaborated. Multiple Linear Regression.

Week 15
• The logic of statistical modeling completely explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization.

Week 16
• Binary classification problems: enters Binomial Logistic Regression. How does it generalize the Linear Regression problem? Probability Theory: a Maximum Likelihood Estimate (MLE).

Week 17
• Multinomial Logistic Regression for classification problems. Cross-Validation and Regularization in regression problems.

Week 18
• Decision Trees and Random Forests: very complicated classification problems and very powerful solutions. Understanding Information Gain: elements of Information Theory.

Week 19
• Random Forests: a case study. Showcase: XGBoost in R as a Swiss knife regression and classification problem solver. Learning material + directions for further work.

Week 20
• Running R code in parallel to speed up data processing and modeling. Run large R jobs in batch mode.

Week 21
• Introduction to RStudio Server and laying out foundations to work with remote Linux servers for project development. Reporting in R Markdown begins.

Week 22
• Reporting in R Markdown. Interactive visualizations with Plotly, VisNetwork, and Leaflet maps.

Week 23
• Reporting in R Markdown. Interactive visualizations with Plotly, VisNetwork, and Leaflet maps.

Week 24
• Building a simple analytics websitewith R Markdown. The future: dashboard development with Shiny (a demonstration + learning material).

LECTURER

Goran S. Milovanović, PhD, is a professional Data Scientist with an extensive experience in fundamental and applied research, market analytics, information retrieval, reporting, visualization, and deployment of Data Science projects and systems. He was engaged in teaching and taught independently methodology, statistics, R programming (and other statistical systems as well) since the early 90s, online as well as in a classic setting, in English and Serbian, in the academic, NGO, and for profit sectors, mentoring online R students for years. While having a background in social and cognitive sciences (psychology major and PhD) he is a programmer since the 80s. Since 2017 he is a Data Scientist for Wikidata, the world’s largest open knowledge base, using R as a part of a technological stack encompassing Python, Apache Hadoop and Spark, RDBS, XML, JSON, RDF, SPARQL, Shiny, R Markdown, Docker, and more, to maintain full-stack Data Science projects. Highly skilled in experimental and survey methods and mathematical models of human behavior, he edited and co-authored several books on Internet Behavior and the development of Information Society. In his academic track he studied the problems of choice under risk and uncertainty, distributive semantics, category learning and categorization, reasoning, and causal induction, at the University of Belgrade and New York University. He is an (occasional) blogger at R-bloggers and a regular contributor to the European R Users Conference. He runs a Data Science consultancy firm, DataKolektiv, from Belgrade, Serbia.

PLEASE SEND YOUR NOTIFICATION OF INTEREST TO: goran.milovanovic@datakolektiv.com