10605
Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide students practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. We gained experience with common large-scale computing libraries and infrastructure, including Databricks, Apache Spark and TensorFlow.
Final Project
We built a recommender system on the amazon rating dataset containing 233 million reviews. We used collaborative filtering and content-based filtering to train the model. We used Spark and Databricks on the small experimental datasets and did our training with the Amazon EMR cluster, which has 7-10 m5xlarge nodes. Since our dataset's overall rating is skewed, regarding the goal of training a meaningful recommendation system, our model is successful as shown in RMSE and MAE. Large scale data can cause various unexpected problems in ML pipeline.
Link here
Major Assignments:
1. Building word count application using spark
2. Entity resolution as Text similarity
3. Linear Regression Model to predict release year of song given a set of audio features using Mlib from pyspark
4. Click through Rate prediction pipeline
5. Neural style transfer in tensorflow
6. Autodiff& MLP to classify DB media dataset
7. Model compression and optimization methods