About This Course:

Perhaps the most popular data science methodologies come from machine learning. What distinguishes machine learning from other computer guided decision processes is that it builds prediction algorithms using data. Some of the most popular products that use machine learning include the handwriting readers implemented by the postal service, speech recognition, movie recommendation systems, and spam detectors.

In this course, part of our Professional Certificate Program in Data Science, you will learn popular machine learning algorithms, principal component analysis, and regularization by building a movie recommendation system.

You will learn about training data, and how to use a set of data to discover potentially predictive relationships. As you build the movie recommendation system, you will learn how to train algorithms using training data so you can predict the outcome for future datasets. You will also learn about overtraining and techniques to avoid it such as cross-validation. All of these skills are fundamental to machine learning.

What You’ll Learn:

  • The basics of machine learning
  • How to perform cross-validation to avoid overtraining
  • Several popular machine learning algorithms
  • How to build a recommendation system
  • What is regularization and why it is useful?

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

Who can take this course?

Unfortunately, learners from one or more of the following countries or regions will not be able to register for this course: Iran, Cuba and the Crimea region of Ukraine. While edX has sought licenses from the U.S. Office of Foreign Assets Control (OFAC) to offer our courses to learners in these countries and regions, the licenses we have received are not broad enough to allow us to offer this course in all locations. EdX truly regrets that U.S. sanctions prevent us from offering all of our courses to everyone, no matter where they live.

Meet Your Instructor:

Ansaf Salleb-Aouissi

Ansaf is a Lecturer in discipline of the Computer Science Department at the School of Engineering and Applied Science at Columbia University. She received her her BS in Computer Science in 1996 from the University of Science and Technology (USTHB), Algeria. She earned her masters and Ph.D. degrees in Computer Science from the University of Orleans (France) in 1999 and 2003 respectively.

About MIT Horizon:

MIT Horizon is an expansive content library built to help you explore emerging technologies. Through easy-to-understand lessons, you’ll be guided through the complexities of the latest technologies and simplified expert-level concepts. Designed for both technical and non-technical learners, you can examine bite-size content that can lead to maximum career outcomes.

For a limited time, gain access to the complete MIT Horizon library.

Register today for exclusive entry.

What is a Bootcamp?

Our facilitated bootcamps focus on rapid skill acquisition by progressing you through a standard course on an accelerated schedule with peers who are committed to progressing on pace. Our bootcamps include:

  • Live kick-off event
  • Instructor facilitated Q&A for expert feedback and coaching
  • Learner Success Support: welcome call, advising sessions, personalized pace reminders
  • 24/7 help desk

About This Course:

This bootcamp covers the basics of data visualization and exploratory data analysis. We will use three motivating examples and ggplot2; a data visualization package for the statistical programming language R. We will start with simple datasets and then graduate to case studies about world health, economics, and infectious disease trends in the United States.

We’ll also be looking at how mistakes, biases, systematic errors, and other unexpected problems often lead to data that should be handled with care.

The fact that it can be difficult or impossible to notice a mistake within a dataset makes data visualization particularly important. The growing availability of informative datasets and software tools has led to increased reliance on data visualizations across many areas. Data visualization provides a powerful way to communicate data-driven findings, motivate analyses, and detect flaws.

This course can be used towards completion of a Professional Certificate in Data Science.

What You Will Learn:

  • Data visualization principles
  • How to communicate data-driven findings
  • How to use ggplot2 to create custom plots
  • The weaknesses of several widely used plots and why you should avoid them

Meet Your Instructors:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Leonardo Palomera

Leonardo Palomera, through his professional and academic experiences, has become a specialist in data analytics for a variety of subjects including education, statistics, economics, and finance. Currently Leonardo works as a Data Scientist with Pearson Advance to support our goal of personalizing support that leads to high course completion rates. Additionally, Leonardo has held teaching positions at University of Southern California (USC), University of California, Los Angeles (UCLA), University of Colorado, Boulder, University of Denver, and California State University, Long Beach (CSULB). His goal is to empower students to gain the knowledge and skills they need to conduct robust analytics on a host of real-world problems.

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

What is a Bootcamp?

Our facilitated bootcamps focus on rapid skill acquisition by progressing you through a standard course on an accelerated schedule with peers who are committed to progressing on pace. Our bootcamps include:

  • Live kick-off event
  • Instructor facilitated Q&A for expert feedback and coaching
  • Learner Success Support: welcome call, advising sessions, personalized pace reminders
  • 24/7 help desk

About This Course:

This bootcamp will introduce you to the basics of R programming. You can better retain R when you learn it to solve a specific problem, so you’ll use a real-world dataset about crime in the United States. You will learn the R skills needed to answer essential questions about differences in crime across the different states.

We’ll cover R’s functions and data types, then tackle how to operate on vectors and when to use advanced functions like sorting. You’ll learn how to apply general programming features like “if-else,” and “for loop” commands, and how to wrangle, analyze and visualize data.

We help you develop a skill set that includes R programming, data wrangling with dplyr, data visualization with ggplot2, file organization with UNIX/Linux, version control with git and GitHub, and reproducible document preparation with RStudio.

This course can be used towards completion of a Professional Certificate in Data Science .

What You Will Learn:

  • Basic R syntax
  • Foundational R programming concepts such as data types, vectors arithmetic, and indexing
  • How to perform operations in R including sorting, data wrangling using dplyr, and making plots

Meet Your Instructors:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Leonardo Palomera

Leonardo Palomera, through his professional and academic experiences, has become a specialist in data analytics for a variety of subjects including education, statistics, economics, and finance. Currently Leonardo works as a Data Scientist with Pearson Advance to support our goal of personalizing support that leads to high course completion rates. Additionally, Leonardo has held teaching positions at University of Southern California (USC), University of California, Los Angeles (UCLA), University of Colorado, Boulder, University of Denver, and California State University, Long Beach (CSULB). His goal is to empower students to gain the knowledge and skills they need to conduct robust analytics on a host of real-world problems.

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

About This Course:

In this course, part of our Professional Certificate Program in Data Science, you will learn valuable concepts in probability theory. The motivation for this course is the circumstances surrounding the financial crisis of 2007-2008. Part of what caused this financial crisis was that the risk of some securities sold by financial institutions was underestimated. To begin to understand this very complicated event, we need to understand the basics of probability.

We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem. These statistical concepts are fundamental to conducting statistical tests on data and understanding whether the data you are analyzing is likely occurring due to an experimental method or to chance.

Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists.

What You’ll Learn:

  • Important concepts in probability theory including random variables and independence
  • How to perform a Monte Carlo simulation
  • The meaning of expected values and standard errors and how to compute them in R

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

About This Course:

Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we’ll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values. Then, to understand statements about the probability of a candidate winning, you will learn about Bayesian modeling. Finally, at the end of the course, we will put it all together to recreate a simplified version of an election forecast model and apply it to the 2016 election.

What You’ll Learn:

  • The concepts necessary to define estimates and margins of errors of populations, parameters, estimates and standard errors in order to make predictions about data
  • How to use models to aggregate data from different sources
  • The very basics of Bayesian statistics and predictive modeling

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

About This Course:

A typical data analysis project may involve several parts, each including several data files and different scripts with code. Keeping all this organized can be challenging.

Part of our Professional Certificate Program in Data Science, this course explains how to use Unix/Linux as a tool for managing files and directories on your computer and how to keep the file system organized. You will be introduced to the version control systems git, a powerful tool for keeping track of changes in your scripts and reports. We also introduce you to GitHub and demonstrate how you can use this service to keep your work in a repository that facilitates collaborations.

Finally, you will learn to write reports in R markdown which permits you to incorporate text and code into a document. We’ll put it all together using the powerful integrated desktop environment RStudio.

What You’ll Learn:

  • How to use Unix/Linux to manage your file system
  • How to perform version control with git
  • How to start a repository on GitHub
  • How to leverage the many useful features provided by RStudio

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

About This Course:

In this course, part of our Professional Certificate Program in Data Science, we cover several standard steps of the data wrangling process like importing data into R, tidying data, string processing, HTML parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but a data scientist will likely face them all at some point.

Very rarely is data easily accessible in a data science project. It’s more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and tidy the data, using the tidy verse package. The steps that convert data from its raw form to the tidy form is called data wrangling.

This process is a critical step for any data scientist. Knowing how to wrangle and clean data will enable you to make critical insights that would otherwise be hidden.

What You’ll Learn:

  • Importing data into R from different file formats
  • Web scraping
  • How to tidy data using the tidy verse to better facilitate analysis
  • String processing with regular expressions (regex)
  • Wrangling data using dplyr
  • How to work with dates and times as file formats
  • Text mining

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

About This Course:

Perhaps the most popular data science methodologies come from machine learning. What distinguishes machine learning from other computer guided decision processes is that it builds prediction algorithms using data. Some of the most popular products that use machine learning include the handwriting readers implemented by the postal service, speech recognition, movie recommendation systems, and spam detectors.

In this course, part of our Professional Certificate Program in Data Science, you will learn popular machine learning algorithms, principal component analysis, and regularization by building a movie recommendation system.

You will learn about training data, and how to use a set of data to discover potentially predictive relationships. As you build the movie recommendation system, you will learn how to train algorithms using training data so you can predict the outcome for future datasets. You will also learn about overtraining and techniques to avoid it such as cross-validation. All of these skills are fundamental to machine learning.

What You’ll Learn:

  • The basics of machine learning
  • How to perform cross-validation to avoid overtraining
  • Several popular machine learning algorithms
  • How to build a recommendation system
  • What is regularization and why it is useful?

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

Who can take this course?

Unfortunately, learners from one or more of the following countries or regions will not be able to register for this course: Iran, Cuba and the Crimea region of Ukraine. While edX has sought licenses from the U.S. Office of Foreign Assets Control (OFAC) to offer our courses to learners in these countries and regions, the licenses we have received are not broad enough to allow us to offer this course in all locations. EdX truly regrets that U.S. sanctions prevent us from offering all of our courses to everyone, no matter where they live.

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

About This Course:

To become an expert data scientist you need practice and experience. By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

Unlike the rest of our Professional Certificate Program in Data Science, in this course, you will receive much less guidance from the instructors. When you complete the project you will have a data product to show off to potential employers or educational programs, a strong indicator of your expertise in the field of data science.

What You’ll Learn:

  • How to apply the knowledge base and skills learned throughout the series to a real-world problem
  • How to independently work on a data analysis project

Frequently Asked Questions:

Honor code statement
HarvardX requires individuals who enroll in its courses on edX to abide by the terms of the edX honor code. HarvardX will take appropriate corrective action in response to violations of the edX honor code, which may include dismissal from the HarvardX course; revocation of any certificates received for the HarvardX course; or other remedies as circumstances warrant. No refunds will be issued in the case of corrective action for such violations. Enrollees who are taking HarvardX courses as part of another program will also be governed by the academic policies of those programs.

Research statement
By registering as an online learner in our open online courses, you are also participating in research intended to enhance HarvardX’s instructional offerings as well as the quality of learning and related sciences worldwide. In the interest of research, you may be exposed to some variations in the course materials. HarvardX does not use learner data for any purpose beyond the University’s stated missions of education and research. For purposes of research, we may share information we collect from online learning activities, including Personally Identifiable Information, with researchers beyond Harvard. However, your Personally Identifiable Information will only be shared as permitted by applicable law, will be limited to what is necessary to perform the research, and will be subject to an agreement to protect the data. We may also share with the public or third parties aggregated information that does not personally identify you. Similarly, any research findings will be reported at the aggregate level and will not expose your personal identity.

Please read the edX Privacy Policy for more information regarding the processing, transmission, and use of data collected through the edX platform.

Nondiscrimination/anti-harassment statement
Harvard University and HarvardX are committed to maintaining a safe and healthy educational and work environment in which no member of the community is excluded from participation in, denied the benefits of, or subjected to discrimination or harassment in our program. All members of the HarvardX community are expected to abide by Harvard policies on nondiscrimination, including sexual harassment, and the edX Terms of Service. If you have any questions or concerns, please contact harvardx@harvard.edu and/or report your experience through the edX contact form.

Meet Your Instructor:

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

What you will learn

  • Fundamental R programming skills
  • Statistical concepts such as probability, inference, and modeling and how to apply them in practice
  • Gain experience with the tidyverse, including data visualization with ggplot2 and data wrangling with dplyr
  • Become familiar with essential tools for practicing data scientists such as Unix/Linux, git and GitHub, and RStudio
  • Implement machine learning algorithms
  • In-depth knowledge of fundamental data science concepts through motivating real-world case studies

Program Class List

1
Data Science: R Basics

Course Details
Build a foundation in R and learn how to wrangle, analyze, and visualize data.

2
Data Science: Visualization

Course Details
Learn basic data visualization principles and how to apply them using ggplot2.

3
Data Science: Probability

Course Details
Learn probability theory -- essential for a data scientist -- using a case study on the financial crisis of 2007-2008.

4
Data Science: Inference and Modeling

Course Details
Learn inference and modeling, two of the most widely used statistical tools in data analysis.

5
Data Science: Productivity Tools

Course Details
Keep your projects organized and produce reproducible reports using GitHub, git, Unix/Linux, and RStudio.

6
Data Science: Wrangling

Course Details
Learn to process and convert raw data into formats needed for analysis.

7
Data Science: Linear Regression

Course Details
Learn how to use R to implement linear regression, one of the most common statistical modeling approaches in data science.

8
Data Science: Machine Learning

Course Details
Build a movie recommendation system and learn the science behind one of the most popular and successful data science techniques.

9
Data Science: Capstone

Course Details
Show what you've learned from the Professional Certificate Program in Data Science.

Meet your instructor

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.