Data Science Fundamentals LiveLessons teaches you the foundational concepts, theory, and techniques you need to know to become an effective data scientist. The videos present you with applied, example-driven lessons in Python and its associated ecosystem of libraries, where you get your hands dirty with real datasets and see real results.

Description

If nothing else, by the end of this video course you will have analyzed a number of datasets from the wild, built a handful of applications, and applied machine learning algorithms in meaningful ways to get real results. And along the way you learn the best practices and computational techniques used by a professional data scientist. More specifically, you learn how to acquire data that is openly accessible on the Internet by working with APIs. You learn how to parse XML and JSON data to load it into a relational database.

About the Instructor

Jonathan Dinu is an author, researcher, and most importantly, an educator. He is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII), where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity. In addition to his professional data science experience, he has run data science trainings for a Fortune 500 company and taught workshops at Strata, PyData, and DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley, and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.

Jonathan has always had a passion for sharing the things he has learned in the most creative ways he can. When he is not working with students, you can find him blogging about data, visualization, and education at hopelessoptimism.com or rambling on Twitter @jonathandinu.

Skill Level

  • Beginner

What You Will Learn

  • How to get up and running with a Python data science environment
  • The essentials of Python 3, including object-oriented programming
  • The basics of the data science process and what each step entails
  • How to build a simple (yet powerful) recommendation engine for Airbnb listings
  • Where to find quality data sources and how to work with APIs programmatically
  • Strategies for parsing JSON and XML into a structured form
  • The basics of relational databases and how to use an ORM to interface with them in Python
  • Best practices of data validation, including common data quality checks

Who Should Take This Course

  • Aspiring data scientists looking to break into the field and learn the essentials necessary
  • Journalists, consultants, analysts, or anyone else who works with data and looking to take a programmatic approach to exploring data and conducting analyses
  • Quantitative researchers interested in applying theory to real projects and taking a computational approach to modeling.
  • Software engineers interested in building intelligent applications driven by machine learning
  • Practicing data scientists already familiar with another programming environment looking to learn how to do data science with Python

Course Requirements

  • Basic understanding of programming
  • Familiarity with Python and statistics are a plus

Lesson Descriptions

Lesson 1: Introduction to Data Science with Python

Lesson 1 begins with a working definition of data science (as we use it in the course), gives a brief history of the field, and provides motivating examples of data science products and applications. This lesson covers how to get set up with a data science programming environment locally, as well as gives you a crash course in the Python programming language if you are unfamiliar with it or are coming from another language such as R. Finally, it ends with an overview of the concepts and tools that the rest of the lessons cover to hopefully motivate you for and excite you about what’s to come!

Lesson 2: The Data Science Process—Building Your First Application

Lesson 2 introduces the data science process by walking through an end-to-end example of building your very first data science application, an AirBnB listing recommender.

You continue to learn how to work with and manipulate data in Python, without any external libraries yet, and leverage the power of the built-in Python standard library. The core application of this lesson covers the basics of building a recommendation engine and shows you how, with simple statistics and a little ingenuity, you can build a compelling recommender, given the right data. And finally, it ends with a formal treatment of the data science process and the individual steps it entails.

Lesson 3: Acquiring Data—Sources and Methods

Lesson 3 begins the treatment of each of the specific stages of the data science process, starting with the first: data acquisition. The lesson covers the basics of finding the appropriate data source for your problem and how to download the datasets you need once you have found them.

Starting with an overview of how the infrastructure behind the Internet works, you learn how to programmatically make HTTP requests in Python to access data through APIs, as well as the basics of two of the most common data formats: JSON and XML. The lesson ends by setting up the dataset we use for the rest of the course: Foursquare Venues.

Working with the Foursquare dataset, you learn how to interact with APIs and do some minor web scraping. You also learn how to find and acquire data from a variety of sources and keep track of its lineage all along the way. You learn to put yourself in the data science mindset and how to see the data (hidden in plain sight) that we interact with every day.

Lesson 4: Adding Structure—Data Parsing and Storage

Lesson 4 picks up with the second stage of what traditionally is referred to as an extract, transform, and load (ETL) pipeline, adding structure through the transformation of raw data.

You see how to work with a variety of data formats, including XML and JSON, by parsing the data we have acquired to eventually load it into an environment better-suited to exploration and analysis: a relational database. But before we load our data into a database, we take a short diversion to talk about how to conceptually model structure in data with code. You get a primer in object-oriented programming and learn how to leverage it to create abstractions and data models that define how you can interface with your data.

Lesson 5: Storing Data: Relational Databases (with SQLite)

Lesson 5 starts with an introduction to one of the most ubiquitous data technologies—the relational database. The lesson serves as an end cap to the ETL pipeline of the previous videos. You learn the ins and outs of the various strategies for storing data and see how to map the abstractions you created in Python to database tables through the use of an object-relational mapper (ORM). By being able to query and manipulate data with Python while persisting data in a database reliably, the interface ORMs provide gives you the best of both worlds.

Lesson 6: Data Validation and Exploration

Lesson 6 starts by showing you how to effectively query your data to understand what it contains, uncover any biases it might contain, and learn the best practices of dealing with missing values. After you have validated the quality of the data, you use descriptive statistics to learn how your data is distributed as well as learn the limits of point statistics (or rather single number estimates) and why it is often necessary to use visual techniques.

About LiveLessons Video Training

The LiveLessons Video Training series publishes hundreds of hands-on, expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. This professional and personal technology video series features world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, IBM Press, Pearson IT Certification, Prentice Hall, Sams, and Que. Topics include: IT Certification, Programming, Web Development, Mobile Development, Home and Office Technologies, Business and Management, and more. View all LiveLessons on InformIT at: http://www.informit.com/livelessons

About MIT horizon

MIT Horizon is an expansive content library built to help you explore emerging technologies. Through easy-to-understand lessons, you’ll be guided through the complexities of the latest technologies and simplified expert-level concepts. Designed for both technical and non-technical learners, you can examine bite-size content that can lead to maximum career outcomes.

For a limited time, gain access to the complete MIT Horizon library.

Register today for exclusive entry.

Program overview

Gain an interdisciplinary understanding of the essential fundamentals of analytics, including analysis methods, analytical tools, such as R, Python and SQL, and business applications.

Using common analytics software and tools, statistical and machine learning methods, and data-intensive computing and visualization techniques, learners will gain the experience necessary to integrate all of these parts for maximum impact.

Project experience is also included as part of the MicroMasters® program. Through these projects, learners will hone their skills with data collection, storage, analysis, and visualization tools, as well as gain instincts for how and when each tool should be used.

These projects provide hands-on experience with real-world business applications of analytics and a deeper understanding of how to apply analytics skills to make the biggest difference.

 

What you will learn

  • Use essential analytics tools like R, Python, SQL, and more.
  • Understand fundamental models and methods of analytics, and how and when to apply them.
  • Learn to build a data analysis pipeline, from collection and storage through analysis and interactive visualization.
  • Apply your new analytics skills in a business context to maximize your impact.

Program Class List

1
Computing for Data Analysis

Course Details
A hands-on introduction to basic programming principles and practice relevant to modern data analysis, data mining, and machine learning.

2
Data Analytics for Business

Course Details
This course prepares students to understand business analytics and become leaders in these areas in business organizations.

3
Introduction to Analytics Modeling

Course Details
Learn essential analytics models and methods and how to appropriately apply them, using tools such as R, to retrieve desired insights.

Meet your instructors

Joel Sokol

Director of the Master of Science in Analytics program
He received his PhD in operations research from MIT and his bachelor’s degrees in mathematics, computer science, and applied sciences in engineering from Rutgers University. His primary research interests are in sports analytics and applied operations research. He has worked with teams or leagues in all three of the major American sports. Dr. Sokol's LRMC method for predictive modeling of the NCAA basketball tournament is an industry leader, and his non-sports research has won the EURO Management Science Strategic Innovation Prize. Dr. Sokol has also won recognition for his teaching and curriculum development from IIE and the NAE, and is the recipient of Georgia Tech's highest awards for teaching.

Richard W. Vuduc

Associate Professor of Computational Science and Engineering
Associate Professor of Computational Science and Engineering at the Georgia Institute of Technology. He received his Ph.D. in Computer Science from the University of California, Berkeley.

Sridhar Narasimhan

Professor at The Georgia Institute of Technology
Sridhar Narasimhan is Professor of IT Management and Co-Director -Business Analytics Center (BAC), Scheller College of Business. The BAC partners with its Executive Council companies in the analytics space and supports Scheller’s BSBA, MBA, and MS Analytics programs. Professor Narasimhan has developed and taught the MBA IT Practicum course. Since 2016, he has been teaching Business Analytics to undergraduate and MBA students at Scheller. Professor Narasimhan is the founder and first Area Coordinator of the nationally ranked Information Technology Management area. In fall 2010, he was the Acting Dean and led the College in its successful AACSB Maintenance of Accreditation effort. He was Senior Associate Dean from 2007 through 2015.
Charles Turnitsa - Pearson Advance

Charles Turnitsa

Professor at The Georgia Institute of Technology
Dr. Charles "Chuck" Turnitsa has spent a career, since the early 1990s, in performing information systems and modeling based research and development, chiefly for the Department of Defense and for NASA. He received his PhD from Old Dominion University in Modeling and Simulation (M&S), and has spent some years teaching a variety of topics in the field. Most recently, before coming to Georgia Tech, he spent two years leading the M&S Graduate Program at Columbus State University. Now he is serving as research faculty with Georgia Tech Research Institute, continuing research into various topics related to M&S, and continuing to teach graduate level and professional education level topics in information systems and M&S.

What you will learn

  • The history of data science, tangible illustrations of how data science and analytics are used in decision making across multiple sectors today, and expert opinion on what the future might hold
  • A practical understanding of the fundamental methods used by data scientists including; statistical thinking and conditional probability, machine learning and algorithms, and effective approaches for data visualization
  • The major components of the Internet of Things (IoT) and the potential of IoT to totally transform the way in which we live and work in the not-to-distant future
  • How data scientists are using natural language processing (NLP), audio and video processing to extract useful information from books, scientific articles, twitter feeds, voice recordings, YouTube videos and much more

Program Class List

1
Statistical Thinking for Data Science and Analytics

Course Details
Learn how statistics plays a central role in the data science approach.

2
Machine Learning for Data Science and Analytics

Course Details
Learn the principles of machine learning and the importance of algorithms.

3
Enabling Technologies for Data Science and Analytics: The Internet of Things

Course Details
Discover the relationship between Big Data and the Internet of Things (IoT).

Meet your instructors

Tian Zheng

About Me

Tian Zheng is associate professor of Statistics at Columbia University. She obtained her PhD from Columbia in 2002. Her research is to develop novel methods and improve existing methods for exploring and analyzing interesting patterns in complex data from different application domains. Her current projects are in the fields of statistical genetics, bioinformatics and computational biology, feature selection and classification for high dimensional data, and network analysis. Especially, Dr. Zheng have been developing statistical and computational tools for high dimensional data, searching for genetic interactions associated with complex human disorders, quantifying social structure and studying hard-to-reach populations using survey questions, with more than 40 peer-reviewed publications in journals including JASA, AOAS and PNAS. Her work was recognized with the 2008 Outstanding Statistical Application Award from the American Statistical Association, The Mitchell Prize from ISBA and a Google research award. She is on the editorial board of Statistical Analysis and Data Mining and Frontier in Genetics. She was Associate Editor for JASA from 2007 to 2013.

Kathy McKeown

About Me

A leading scholar and researcher in the field of natural language processing, McKeown focuses her research on big data; her interests include text summarization, question answering, natural language generation, multimedia explanation, digital libraries, and multilingual applications. Her research group's Columbia Newsblaster, which has been live since 2001, is an online system that automatically tracks the day's news, and demonstrates the group's new technologies for multi-document summarization, clustering, and text categorization, among others. Currently, she leads a large research project involving prediction of technology emergence from a large collection of journal articles. McKeown joined Columbia in 1982, immediately after earning her Ph.D. from University of Pennsylvania. In 1989, she became the first woman professor in the school to receive tenure, and later the first woman to serve as a department chair (1998-2003).

Ansaf Salleb-Aouissi

Ansaf is a Lecturer in discipline of the Computer Science Department at the School of Engineering and Applied Science at Columbia University. She received her her BS in Computer Science in 1996 from the University of Science and Technology (USTHB), Algeria. She earned her masters and Ph.D. degrees in Computer Science from the University of Orleans (France) in 1999 and 2003 respectively.

Cliff Stein

About Me

His research interests include the design and analysis of algorithms, combinatorial optimization, operations research, network algorithms, scheduling, algorithm engineering and computational biology. Professor Stein has published many influential papers in the leading conferences and journals in his field, and has occupied a variety of editorial positions including the journals ACM Transactions on Algorithms, Mathematical Programming, Journal of Algorithms, SIAM Journal on Discrete Mathematics and Operations Research Letters. His work has been supported by the National Science Foundation and Sloan Foundation. He is the winner of several prestigious awards including an NSF Career Award, an Alfred Sloan Research Fellowship and the Karen Wetterhahn Award for Distinguished Creative or Scholarly Achievement. He is also the co-author of the two textbooks. Introduction to Algorithms, with T. Cormen, C. Leiserson and R. Rivest is currently the best-selling textbook in algorithms and has sold over half a million copies and been translated into 15 languages. Discrete Math for Computer Scientists , with Ken Bogart and Scot Drysdale, is a new text book which covers discrete math at an undergraduate level.

David Blei

About Me

David Blei joined Columbia in Fall 2014 as a Professor of Computer Science and Statistics. His research involves probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data. Professor Blei earned his Bachelor's degree in Computer Science and Mathematics from Brown University (1997) and his PhD in Computer Science from the University of California, Berkeley (2004). Before arriving to Columbia, he was an Associate Professor of Computer Science at Princeton University. He has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), and Blavatnik Faculty Award (2013).

Itsik Peer

About Me

Itsik Pe’er is an associate professor in the Department of Computer Science. His laboratory develops and applies computational methods for the analysis of high-throughput data in germline human genetics. Specifically, he has a strong interest in isolated populations such as Pacific Islanders and Ashkenazi Jews. The Pe’er Lab has developed methodology to identify hidden relatives — primarily in such isolated populations — that involves inferring their past demography, detecting associations between phenotypes and genetic segments co-inherited from the joint ancestors of hidden relatives, and establishing the exceptional utility of whole-genome sequencing in population genetics. With the arrival of high-throughput sequencing methods, Pe’er has focused on characterizing genetic variation that is unique to isolated populations, including the effects of such variation on phenotype.

Mihalis Yannakakis

About Me

He studied at the National Technical University of Athens (Diploma in Electrical Engineering, 1975), and at Princeton University (PhD in Computer Science, 1979). He worked at Bell Labs Research from 1978 until 2001, as Member of Technical Staff (1978-1991) and as Head of the Computing Principles Research Department (1991-2001). He was Director of Computing Principles Research at Avaya Labs (2001-2002), and Professor of Computer Science at Stanford University (2002-2003). He joined Columbia University in 2004. His research interests include design and analysis of algorithms, complexity theory, combinatorial optimization, game theory, databases, and modeling, verification and testing of reactive systems.

Peter Orbanz

About Me

Before coming to New York, he was a Research Fellow in the Machine Learning Group of Zoubin Ghahramani at the University of Cambridge, and previously a graduate student of Joachim M. Buhmann at ETH Zurich. His main research interests are the statistics of discrete objects and structures: permutations, graphs, partitions, and binary sequences. Most of his recent work concerns representation problems and latent variable algorithms in Bayesian nonparametrics. More generally, he is interested in all mathematical aspects of machine learning and artificial intelligence.

Fred Jiang

Assistant Professor in the Electrical Engineering Department at Columbia University
Fred received his B.Sc. (2004) and M.Sc. (2007) in Electrical Engineering and Computer Science, and his Ph.D. (2010) in Computer Science, all from UC Berkeley. Before joining SEAS, he was Senior Staff Researcher and Director of Analytics and IoT Research at Intel Labs China. Fred’s research interests include cyber physical systems and data analytics, smart and sustainable buildings, mobile and wearable systems, environmental monitoring and control, and connected health & fitness. His ACme building energy platform has been widely adopted by universities and industries, including Lawrence Berkeley National Laboratory, National Taiwan University, and several commercial companies. His project on wearable and mobile fitness, in collaboration with University of Virginia, was featured on New Scientist and the Economist magazine. His air-quality monitoring project has been featured on China Central Television and People’s Daily, and was successfully incubated into a startup. He is actively serving on several technical and organizing committees including ACM SenSys, ACM/IEEE IPSN, and ACM BuildSys. He was a National Science Foundation (NSF) Graduate Fellow and a Vodafone-US Foundation Fellow.

Julia Hirschberg

Percy K. and Vida LW Hudson Professor of Computer Science at Columbia University
Julia Hirschberg does research in prosody, spoken dialogue systems, and emotional and deceptive speech. She received her PhD in Computer Science from the University of Pennsylvania in 1985. She worked at Bell Laboratories and AT&T Laboratories -- Research from 1985-2003 as a Member of Technical Staff and as a Department Head, creating the Human-Computer Interface Research Department at Bell Labs and moving with it to AT&T Labs. She served as editor-in-chief of Computational Linguistics from 1993-2003 and as an editor-in-chief of Speech Communication from 2003-2006. She is on the Editorial Board of Speech Communication and of the Journal of Pragmatics. She was on the Executive Board of the Association for Computational Linguistics (ACL) from 1993-2003, have been on the Permanent Council of International Conference on Spoken Language Processing (ICSLP) since 1996, and served on the board of the International Speech Communication Association (ISCA) from 1999-2007 (as President 2005-2007). She is currently the chair of the ISCA Distinguished Lecturers selection committee. She is on the IEEE SLTC, the executive board of the North American chapter of the Association for Computational Linguistics, the CRA Board of Directors, and the board of the CRA-W. She has been active in working for diversity at AT&T and at Columbia. She has been a fellow of the American Association for Artificial Intelligence since 1994, an ISCA Fellow since 2008, and became an ACL Fellow in the founding group in 2012. She received a Columbia Engineering School Alumni Association (CESAA) Distinguished Faculty Teaching Award in 2009, received an honorary doctorate (hedersdoktor) from KTH in 2007, is the 2011 recipient of the IEEE James L. Flanagan Speech and Audio Processing Award and, also received the ISCA Medal for Scientific Achievement in the same year.

Michael Collins

Vikram S. Pandit Professor of Computer Science at Columbia University
Michael J. Collins is a researcher in the field of computational linguistics. His research interests are in natural language processing as well as machine learning and he has made important contributions in statistical parsing and in statistical machine learning. One notable contribution is a state-of-the-art parser for the Penn Wall Street Journal corpus. His research covers a wide range of topics such as parse re-ranking, tree kernels, semi-supervised learning, machine translation and exponentiated gradient algorithms with a general focus on discriminative models and structured prediction.

Shih-Fu Chang

Richard Dicker Chair Professor at Columbia University
Shih-Fu Chang’s research interest is focused on multimedia retrieval, computer vision, signal processing, and machine learning. He and his students have developed some of the earliest image/video search engines, such as VisualSEEk, VideoQ, and WebSEEk, contributing to the foundation of the vibrant field of content-based visual search and commercial systems for Web image search. Recognized by many best paper awards and high citation impacts, his scholarly work set trends in several important areas, such as compressed-domain video manipulation, video structure parsing, image authentication, large-scale indexing, and video content analysis. His group demonstrated the best performance in video annotation (2008) and multimedia event detection (2010) in the international video retrieval evaluation forum TRECVID. The video concept classifier library, ontology, and annotated video corpora released by his group have been used by more than 100 groups. He co-led the ADVENT university-industry research consortium with the participation of more than 25 industry sponsors. He has received IEEE Signal Processing Society Technical Achievement Award, ACM SIGMM Technical Achievement Award, IEEE Kiyo Tomiyasu award, IBM Faculty award, and Service Recognition Awards from IEEE and ACM. He served as the general co-chair of ACM Multimedia conference in 2000 and 2010, Editor-in-Chief of the IEEE Signal Processing Magazine (2006-8), Chairman of Columbia Electrical Engineering Department (2007-2010), Senior Vice Dean of Columbia Engineering School (2012-date), and advisor for several companies and research institutes. His research has been broadly supported by government agencies as well as many industry sponsors. He is a Fellow of IEEE and the American Association for the Advancement of Science.

Zoran Kostic

About Me

Zoran Kostic completed his Ph.D. in Electrical Engineering at the University of Rochester and his Dipl. Ing. degree at the University of Novi Sad. He spent most of his career in industry where he worked in research, product development and in leadership positions. Zoran's expertise spans mobile data systems, wireless communications, signal processing, multimedia, system-on-chip development and applications of parallel computing. His work comprises a mix of research, system architecture and software/hardware development, which resulted in a notable publication record, three dozen patents, and critical contributions to successful products. He has experience in Intellectual Property consulting. Dr. Kostic is an active member of the IEEE, and he has served as an associate editor of the IEEE Transactions on Communications and IEEE Communications Letters.

Andrew Gelman

Andrew Gelman is a professor of statistics and political science and director of the Applied Statistics Center at Columbia University. He has received the Outstanding Statistical Application award from the American Statistical Association, the award for best article published in the American Political Science Review, and the Council of Presidents of Statistical Societies award for outstanding contributions by a person under the age of 40. Andrew has done research on a wide range of topics, including: why it is rational to vote; why campaign polls are so variable when elections are so predictable; why redistricting is good for democracy; reversals of death sentences; police stops in New York City, the statistical challenges of estimating small effects; the probability that your vote will be decisive; seats and votes in Congress; social network structure; arsenic in Bangladesh; radon in your basement; toxicology; medical imaging; and methods in surveys, experimental design, statistical inference, computation, and graphics.
David Madigan - Pearson Advance

David Madigan

David Madigan received a bachelor’s degree in Mathematical Sciences and a Ph.D. in Statistics, both from Trinity College Dublin. He has previously worked for AT&T Inc., Soliloquy Inc., the University of Washington, Rutgers University, and SkillSoft, Inc. He has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance and probabilistic graphical models. He is an elected Fellow of the American Statistical Association and of the Institute of Mathematical Statistics. He recently completed a term as Editor-in-Chief of Statistical Science.

Lauren Hannah

Lauren Hannah is an Assistant Professor in the Department of Statistics at Columbia University. Dr. Hannah received a Ph.D. in Operations Research and Financial Engineering from Princeton University, and an A.B. in Classics, again from Princeton University. After completing her Ph.D., Dr. Hannah completed a postdoc at Duke in the Statistical Science Department. Her interests include machine learning, Bayesian statistics, and energy applications.

Eva Ascarza

Eva Ascarza is an Assistant Professor of Marketing at Columbia Business School. She is a marketing modeler who uses tools from statistics and economics to answer marketing questions. Her main research areas are customer analytics and pricing in the context of subscription businesses. She specializes in understanding and predicting changes in customer behavior, such as customer retention and usage. Another stream of her research focuses on developing statistical methodologies to be used by marketing practitioners. She received her PhD from London Business School (UK) and a MS in Economics and Finance from Universidad de Navarra (Spain).

James Curley

About Me

Dr. Curley has very broad interests in behavioral development. He has conducted and published research at molecular, systems, organismal and evolutionary levels of analysis in both animals and humans. The focus of Dr. Curley’s lab at Columbia is on the development of social behavior. Dr. Curley is interested in how both inherited genetic variability and social experiences during development can shift individual differences in various aspects of social behavior and what the neuroendocrinological basis of these differences may be. He also researches the reliability and validity of social behavioral tests conducted in the laboratory and whether it is possible to utilize alternative statistical and methodological approaches to more appropriately assess social behavior. Dr Curley believes that it is critical to understand how the 'social brains' of humans and other animals have been differentially shaped by evolution and to acknowledge how this should better inform translational research.

Create an end-to-end data analysis workflow in Python using the Jupyter Notebook and learn about the diverse and abundant tools available within the Project Jupyter ecosystem.

Overview

The Jupyter Notebook is a popular tool for learning and performing data science in Python (and other languages used in data science). This video tutorial will teach you about Project Jupyter and the Jupyter ecosystem and get you up and running in the Jupyter Notebook environment. Together, we’ll build a data product in Python, and you’ll learn how to share this analysis in multiple formats, including presentation slides, web documents, and hosted platforms (great for colleagues who do not have Jupyter installed on their machines). In addition to learning and doing Python in Jupyter, you will also learn how to install and use other programming languages, such as R and Julia, in your Jupyter Notebook analysis.

Learn How To

  • Create a start-to-finish Jupyter Notebook workflow: from installing Jupyter to creating your data analysis and ultimately sharing your results
  • Use additional tools within the Jupyter ecosystem that facilitate collaboration and sharing
  • Incorporate other programming languages (such as R) in Jupyter Notebook analyses

Who Should Take This Course

  • Users new to Jupyter Notebooks who want to use the full range of tools within the Jupyter ecosystem
  • Data practitioners who want a repeatable process for conducting, sharing, and presenting data science projects
  • Data practitioners who want to share data science analyses with friends and colleagues who do not use or do not have access to a Jupyter installation

Course Requirements

  • Basic knowledge of Python.
  • Download and install the Anaconda distribution of Python here. You can install either version 2.7 or 3.x, whichever you prefer.
  • Create a GitHub account here (strongly recommended but not required).
  • If you are unable to install software on your computer, you can access a hosted version via the Project Jupyter website (click on “try it in your browser”) or through Microsoft’s Azure Notebooks.

About Pearson Video Training

Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more.

Meet your instructor

Jamie Whitacre

Jamie was the technical project manager for Project Jupyter. She collaborated with Jupyter’s developers and open source community at large to define development strategy, advance feature work, and build community involvement. Jamie has more than 10 years of experience in scientific computing systems, informatics, and data analysis. Integrating research data and systems, streamlining data workflows, cleaning data, and educating users about data tools and workflows are her specialties. Jamie previously worked at the Smithsonian’s National Museum of Natural History designing and developing data pipelines in support of the Global Genome Initiative. She has experience working in academia, government, and industry positions. She earned her graduate degree in Geography from the University of Maryland and her undergraduate degree in Biology from Whitman College.

What you will learn

  • Fundamental R programming skills
  • Statistical concepts such as probability, inference, and modeling and how to apply them in practice
  • Gain experience with the tidyverse, including data visualization with ggplot2 and data wrangling with dplyr
  • Become familiar with essential tools for practicing data scientists such as Unix/Linux, git and GitHub, and RStudio
  • Implement machine learning algorithms
  • In-depth knowledge of fundamental data science concepts through motivating real-world case studies

Program Class List

1
Data Science: R Basics

Course Details
Build a foundation in R and learn how to wrangle, analyze, and visualize data.

2
Data Science: Visualization

Course Details
Learn basic data visualization principles and how to apply them using ggplot2.

3
Data Science: Probability

Course Details
Learn probability theory -- essential for a data scientist -- using a case study on the financial crisis of 2007-2008.

4
Data Science: Inference and Modeling

Course Details
Learn inference and modeling, two of the most widely used statistical tools in data analysis.

5
Data Science: Productivity Tools

Course Details
Keep your projects organized and produce reproducible reports using GitHub, git, Unix/Linux, and RStudio.

6
Data Science: Wrangling

Course Details
Learn to process and convert raw data into formats needed for analysis.

7
Data Science: Linear Regression

Course Details
Learn how to use R to implement linear regression, one of the most common statistical modeling approaches in data science.

8
Data Science: Machine Learning

Course Details
Build a movie recommendation system and learn the science behind one of the most popular and successful data science techniques.

9
Data Science: Capstone

Course Details
Show what you've learned from the Professional Certificate Program in Data Science.

Meet your instructor

Rafael Irizarry

Professor of Biostatistics at Harvard University
Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Data Science Fundamentals Part II teaches you the foundational concepts, theory, and techniques you need to know to become an effective data scientist. The videos present you with applied, example-driven lessons in Python and its associated ecosystem of libraries, where you get your hands dirty with real datasets and see real results.

Description

If nothing else, by the end of this video course you will have analyzed a number of datasets from the wild, built a handful of applications, and applied machine learning algorithms in meaningful ways to get real results. And all along the way you learn the best practices and computational techniques used by professional data scientists. You get hands-on experience with the PyData ecosystem by manipulating and modeling data. You explore and transform data with the pandas library, perform statistical analysis with SciPy and NumPy, build regression models with statsmodels, and train machine learning algorithms with scikit-learn. All throughout the course you learn to test your assumptions and models by engaging in rigorous validation. Finally, you learn how to share your results through effective data visualization.

Code: https://github.com/hopelessoptimism/data-science-fundamentals
Resources: http://hopelessoptimism.com/data-science-fundamentals
Forum:https://gitter.im/data-science-fundamentals
Data: http://insideairbnb.com/get-the-data.html

About the Instructor
Jonathan Dinu is an author, researcher, and most importantly educator. He is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII) where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity. In addition to his professional data science experience, he has run data science trainings for a Fortune 500 company and taught workshops at Strata, PyData, and DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley, and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.

Jonathan has always had a passion for sharing the things he has learned in the most creative ways he can. When he is not working with students you can find him blogging about data, visualization, and education at hopelessoptimism.com or rambling on Twitter @jonathandinu.

Skill Level

  • Beginner

What You Will Learn

  • How to get up and running with a Python data science environment
  • The basics of the data science process and what each step entails
  • How (and why) to perform exploratory data analysis in Python with the pandas library
  • The theory of statistical estimation to make inferences from your data and test hypotheses
  • The fundamentals of probability and how to use scipy to work with distributions in Python
  • How to build and evaluate machine learning models with scikit-learn
  • The basics of data visualization and how to communicate your results effectively
  • The importance of creating reproducible analyses and how to share them effectively

Who Should Take This Course

  • Aspiring data scientists looking to break into the field and learn the essentials necessary.
  • Journalists, consultants, analysts, or anyone else who works with data looking to take a programmatic approach to exploring data and conducting analyses.
  • Quantitative researchers interested in applying theory to real projects and taking a computational approach to modeling.
  • Software engineers interested in building intelligent applications driven by machine learning.
  • Practicing data scientists already familiar with another programming environment looking to learn how to do data science with Python.

Course Requirements

  • Basic understanding of programming.
  • Familiarity with Python and statistics are a plus.

Lesson 7: Exploring Data—Analysis and Visualization

Lesson 7 starts with a short historical diversion on the process and evolution of exploratory data analysis, to help you understand the context behind it. John Tukey, the godfather of EDA, said in the Future of Data Analysis that “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

Next you use matplotlib and seaborn, two Python visualization libraries, to learn how to visually explore a single dimension with histograms and boxplots. But a single dimension can only get us so far. By using scatterplots and other charts for higher dimensional visualization you see how to compare columns of our data to look for relationships between them.

The lesson finishes with a cautionary tale of when statistics lie by exploring the impact of mixed effects and Simpson’s paradox.

Lesson 8: Making Inferences—Statistical Estimation and Evaluation

In Lesson 8 we lay the groundwork for the methods and theory we need to make inferences from data, starting with an overview of the various approaches and techniques that are part of the rich history of statistical analysis.

Next you see how to leverage computational- and sampling-based approaches to make inferences from your data. After learning the basics of hypothesis testing, one of the most used techniques in the data scientist’s tool belt, you see how to use it to optimize a web application with A/B testing. All along the way you learn to appreciate the importance of uncertainty and see how to bound your reasoning with confidence intervals.

And finally, the lesson finishes by discussing the age-old question of correlation versus causation, why it matters, and how to account for it in your analyses.

Lesson 9: Statistical Modeling and Machine Learning

In Lesson 9 you learn how to leverage statistical models to build a powerful model to predict AirBnB listing prices and infer which listings are undervalued. It starts with a primer on probability and statistical distributions using SciPy and NumPy, including how to estimate parameters and fit distributions to data.

Next you learn about the theory of regression through a hands-on application with our AirBnB data and see how to model correlations in your data. By solving for the line of best fit and seeing how to understand its coefficients you can make inferences about your data.

But building a model is only one side of the coin, and if you cannot effectively evaluate how well it performs it might as well be useless. Next you learn how to evaluate a regression model, learn about what could go wrong when fitting a model, and learn to overcome these challenges.

The lesson finishes by talking about the differences between and nuances of statistics, modeling, and machine learning. I provide an overview of the various types of models and algorithms used for machine learning and introduce how to leverage scikit-learn—a robust machine learning library in Python—to make predictions.

About LiveLessons Video Training

The LiveLessons Video Training series publishes hundreds of hands-on, expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. This professional and personal technology video series features world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, IBM Press, Pearson IT Certification, Prentice Hall, Sams, and Que. Topics include: IT Certification, Programming, Web Development, Mobile Development, Home and Office Technologies, Business and Management, and more. View all LiveLessons on InformIT at: http://www.informit.com/livelessons

Data Science Fundamentals LiveLessons teaches you the foundational concepts, theory, and techniques you need to know to become an effective data scientist. The videos present you with applied, example-driven lessons in Python and its associated ecosystem of libraries, where you get your hands dirty with real datasets and see real results.

Description

If nothing else, by the end of this video course you will have analyzed a number of datasets from the wild, built a handful of applications, and applied machine learning algorithms in meaningful ways to get real results. And along the way you learn the best practices and computational techniques used by a professional data scientist. More specifically, you learn how to acquire data that is openly accessible on the Internet by working with APIs. You learn how to parse XML and JSON data to load it into a relational database.

About the Instructor

Jonathan Dinu is an author, researcher, and most importantly, an educator. He is currently pursuing a Ph.D. in Computer Science at Carnegie Mellon’s Human Computer Interaction Institute (HCII), where he is working to democratize machine learning and artificial intelligence through interpretable and interactive algorithms. Previously, he founded Zipfian Academy (an immersive data science training program acquired by Galvanize), has taught classes at the University of San Francisco, and has built a Data Visualization MOOC with Udacity. In addition to his professional data science experience, he has run data science trainings for a Fortune 500 company and taught workshops at Strata, PyData, and DataWeek (among others). He first discovered his love of all things data while studying Computer Science and Physics at UC Berkeley, and in a former life he worked for Alpine Data Labs developing distributed machine learning algorithms for predictive analytics on Hadoop.

Jonathan has always had a passion for sharing the things he has learned in the most creative ways he can. When he is not working with students, you can find him blogging about data, visualization, and education at hopelessoptimism.com or rambling on Twitter @jonathandinu.

Skill Level

  • Beginner

What You Will Learn

  • How to get up and running with a Python data science environment
  • The essentials of Python 3, including object-oriented programming
  • The basics of the data science process and what each step entails
  • How to build a simple (yet powerful) recommendation engine for Airbnb listings
  • Where to find quality data sources and how to work with APIs programmatically
  • Strategies for parsing JSON and XML into a structured form
  • The basics of relational databases and how to use an ORM to interface with them in Python
  • Best practices of data validation, including common data quality checks

Who Should Take This Course

  • Aspiring data scientists looking to break into the field and learn the essentials necessary
  • Journalists, consultants, analysts, or anyone else who works with data and looking to take a programmatic approach to exploring data and conducting analyses
  • Quantitative researchers interested in applying theory to real projects and taking a computational approach to modeling.
  • Software engineers interested in building intelligent applications driven by machine learning
  • Practicing data scientists already familiar with another programming environment looking to learn how to do data science with Python

Course Requirements

  • Basic understanding of programming
  • Familiarity with Python and statistics are a plus

Lesson Descriptions

Lesson 1: Introduction to Data Science with Python

Lesson 1 begins with a working definition of data science (as we use it in the course), gives a brief history of the field, and provides motivating examples of data science products and applications. This lesson covers how to get set up with a data science programming environment locally, as well as gives you a crash course in the Python programming language if you are unfamiliar with it or are coming from another language such as R. Finally, it ends with an overview of the concepts and tools that the rest of the lessons cover to hopefully motivate you for and excite you about what’s to come!

Lesson 2: The Data Science Process—Building Your First Application

Lesson 2 introduces the data science process by walking through an end-to-end example of building your very first data science application, an AirBnB listing recommender.

You continue to learn how to work with and manipulate data in Python, without any external libraries yet, and leverage the power of the built-in Python standard library. The core application of this lesson covers the basics of building a recommendation engine and shows you how, with simple statistics and a little ingenuity, you can build a compelling recommender, given the right data. And finally, it ends with a formal treatment of the data science process and the individual steps it entails.

Lesson 3: Acquiring Data—Sources and Methods

Lesson 3 begins the treatment of each of the specific stages of the data science process, starting with the first: data acquisition. The lesson covers the basics of finding the appropriate data source for your problem and how to download the datasets you need once you have found them.

Starting with an overview of how the infrastructure behind the Internet works, you learn how to programmatically make HTTP requests in Python to access data through APIs, as well as the basics of two of the most common data formats: JSON and XML. The lesson ends by setting up the dataset we use for the rest of the course: Foursquare Venues.

Working with the Foursquare dataset, you learn how to interact with APIs and do some minor web scraping. You also learn how to find and acquire data from a variety of sources and keep track of its lineage all along the way. You learn to put yourself in the data science mindset and how to see the data (hidden in plain sight) that we interact with every day.

Lesson 4: Adding Structure—Data Parsing and Storage

Lesson 4 picks up with the second stage of what traditionally is referred to as an extract, transform, and load (ETL) pipeline, adding structure through the transformation of raw data.

You see how to work with a variety of data formats, including XML and JSON, by parsing the data we have acquired to eventually load it into an environment better-suited to exploration and analysis: a relational database. But before we load our data into a database, we take a short diversion to talk about how to conceptually model structure in data with code. You get a primer in object-oriented programming and learn how to leverage it to create abstractions and data models that define how you can interface with your data.

Lesson 5: Storing Data: Relational Databases (with SQLite)

Lesson 5 starts with an introduction to one of the most ubiquitous data technologies—the relational database. The lesson serves as an end cap to the ETL pipeline of the previous videos. You learn the ins and outs of the various strategies for storing data and see how to map the abstractions you created in Python to database tables through the use of an object-relational mapper (ORM). By being able to query and manipulate data with Python while persisting data in a database reliably, the interface ORMs provide gives you the best of both worlds.

Lesson 6: Data Validation and Exploration

Lesson 6 starts by showing you how to effectively query your data to understand what it contains, uncover any biases it might contain, and learn the best practices of dealing with missing values. After you have validated the quality of the data, you use descriptive statistics to learn how your data is distributed as well as learn the limits of point statistics (or rather single number estimates) and why it is often necessary to use visual techniques.

About LiveLessons Video Training

The LiveLessons Video Training series publishes hundreds of hands-on, expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. This professional and personal technology video series features world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, IBM Press, Pearson IT Certification, Prentice Hall, Sams, and Que. Topics include: IT Certification, Programming, Web Development, Mobile Development, Home and Office Technologies, Business and Management, and more. View all LiveLessons on InformIT at: http://www.informit.com/livelessons

About this course

One of the principal responsibilities of a data scientist is to make reliable predictions based on data. When the amount of data available is enormous, it helps if some of the analysis can be automated. Machine learning is a way of identifying patterns in data and using them to automatically make predictions or decisions. In this data science course, you will learn basic concepts and elements of machine learning.

The two main methods of machine learning you will focus on are regression and classification. Regression is used when you seek to predict a numerical quantity. Classification is used when you try to predict a category (e.g., given information about a financial transaction, predict whether it is fraudulent or legitimate).

For regression, you will learn how to measure the correlation between two variables and compute a best-fit line for making predictions when the underlying relationship is linear. The course will also teach you how to quantify the uncertainty in your prediction using the bootstrap method. These techniques will be motivated by a wide range of examples.

For classification, you will learn the k-nearest neighbor classification algorithm, learn how to measure the effectiveness of your classifier, and apply it to real-world tasks including medical diagnoses and predicting genres of movies.

The course will highlight the assumptions underlying the techniques, and will provide ways to assess whether those assumptions are good. It will also point out pitfalls that lead to overly optimistic or inaccurate predictions.

What you’ll learn

  • Fundamental concepts of machine learning
  • Linear regression, correlation, and the phenomenon of regression to the mean
  • Classification using the k-nearest neighbors algorithm
  • How to compare and evaluate the accuracy of machine learning models
  • Basic probability and Bayes’ theorem

Prerequisites

Foundations of Data Science: Computational Thinking with Python

Foundations of Data Science: Inferential Thinking by Resampling

Meet Your Instructors

Ani Adhikari

Teaching Professor of Statistics at UC Berkeley Ani Adhikari, Senior Lecturer in Statistics at UC Berkeley, has received the Distinguished Teaching Award at Berkeley and the Dean's Award for Distinguished Teaching at Stanford University. While her research interests are centered on applications of statistics in the natural sciences, her primary focus has always been on teaching and mentoring students. She teaches courses at all levels and has a particular affinity for teaching statistics to students who have little mathematical preparation. She received her undergraduate degree from the Indian Statistical Institute, and her Ph.D. in Statistics from Berkeley.

John DeNero

Giancarlo Teaching Fellow in the EECS Department at UC Berkeley John DeNero is the Giancarlo Teaching Fellow in the UC Berkeley EECS Department. He joined the Cal faculty in 2014 to focus on undergraduate education in computer science and data science. He teaches and co-develops two of the largest courses on campus: introductory computer science for majors (3000 students per year) and introductory data science (1500 students per year).

David Wagner

Professor of Computer Science at UC Berkeley David Wagner is Professor of Computer Science at the University of California at Berkeley. He has published over 100 peer-reviewed papers in the scientific literature and has co-authored two books on encryption and computer security. His research has analyzed and contributed to the security of cellular networks, 802.11 wireless networks, electronic voting systems, and other widely deployed systems.

About this course

Analytical models are key to understanding data, generating predictions, and making business decisions. Without models it’s nearly impossible to gain insights from data. In modeling, it’s essential to understand how to choose the right data sets, algorithms, techniques and formats to solve a particular business problem.

In this course, part of the Analytics: Essential Tools and Methods MicroMasters® program, you’ll gain an intuitive understanding of fundamental models and methods of analytics and practice how to implement them using common industry tools like R.

You’ll learn about analytics modeling and how to choose the right approach from among the wide range of options in your toolbox.

You will learn how to use statistical models and machine learning as well as models for:

  • classification;
  • clustering;
  • change detection;
  • data smoothing;
  • validation;
  • prediction;
  • optimization;
  • experimentation;
  • decision making.

What you’ll learn

  • Fundamental analytics models and methods
  • How to use analytics software, including R, to implement various types of models
  • Understanding of when to apply specific analytics models

Prerequisites

  • Probability and statistics
  • Basic programming proficiency
  • Linear algebra
  • Basic calculus

Who can take this course?

Unfortunately, learners from one or more of the following countries or regions will not be able to register for this course: Iran, Cuba and the Crimea region of Ukraine. While edX has sought licenses from the U.S. Office of Foreign Assets Control (OFAC) to offer our courses to learners in these countries and regions, the licenses we have received are not broad enough to allow us to offer this course in all locations. EdX truly regrets that U.S. sanctions prevent us from offering all of our courses to everyone, no matter where they live.

Meet Your Instructors

Joel Sokol

Director of the Master of Science in Analytics program
He received his PhD in operations research from MIT and his bachelor’s degrees in mathematics, computer science, and applied sciences in engineering from Rutgers University. His primary research interests are in sports analytics and applied operations research. He has worked with teams or leagues in all three of the major American sports. Dr. Sokol's LRMC method for predictive modeling of the NCAA basketball tournament is an industry leader, and his non-sports research has won the EURO Management Science Strategic Innovation Prize. Dr. Sokol has also won recognition for his teaching and curriculum development from IIE and the NAE, and is the recipient of Georgia Tech's highest awards for teaching.

About this course

Today, businesses, consumers, and societies leave behind massive amounts of data as a by-product of their activities. Leading-edge companies in every industry are using analytics to replace intuition and guesswork in their decision-making. As a result, managers are collecting and analyzing enormous data sets to discover new patterns and insights and running controlled experiments to test hypotheses.

This course prepares students to understand business analytics and become leaders in these areas in business organizations. This course teaches the scientific process of transforming data into insights for making better business decisions. It covers the methodologies, issues, and challenges related to analyzing business data. It will illustrate the processes of analytics by allowing students to apply business analytics algorithms and methodologies to business problems. The use of examples places business analytics techniques in context and teaches students how to avoid the common pitfalls, emphasizing the importance of applying proper business analytics techniques.

What you’ll learn

After taking this course, students should be able to:

  • approach business problems data-analytically. Students should be able to think carefully and systematically about whether and how data and business analytics can improve business performance.
  • develop business analytics ideas, analyze data using business analytics software, and generate business insights.

Prerequisites

Computing for Data Analysis, Introduction to Analytics Modeling, and each of their prerequisites

 

Who can take this course?

Unfortunately, learners from one or more of the following countries or regions will not be able to register for this course: Iran, Cuba and the Crimea region of Ukraine. While edX has sought licenses from the U.S. Office of Foreign Assets Control (OFAC) to offer our courses to learners in these countries and regions, the licenses we have received are not broad enough to allow us to offer this course in all locations. EdX truly regrets that U.S. sanctions prevent us from offering all of our courses to everyone, no matter where they live.

Meet Your Instructors

Sridhar Narasimhan

Professor at The Georgia Institute of Technology
Sridhar Narasimhan is Professor of IT Management and Co-Director -Business Analytics Center (BAC), Scheller College of Business. The BAC partners with its Executive Council companies in the analytics space and supports Scheller’s BSBA, MBA, and MS Analytics programs. Professor Narasimhan has developed and taught the MBA IT Practicum course. Since 2016, he has been teaching Business Analytics to undergraduate and MBA students at Scheller. Professor Narasimhan is the founder and first Area Coordinator of the nationally ranked Information Technology Management area. In fall 2010, he was the Acting Dean and led the College in its successful AACSB Maintenance of Accreditation effort. He was Senior Associate Dean from 2007 through 2015.