====== GE461: Introduction to Data Science - Spring 2022 ====== Introduction to data science fundamentals, techniques and applications; data collection, preparation, storage and querying; parametric models for data; models and methods for fitting, analysis, evaluation, and validation; dimensionality reduction, visualization; various learning methods, classifiers, clustering, data and text mining; applications in diverse domains such as business, medicine, social networks, computer vision; breadth knowledge on topics and hands-on experience through projects and computer assignments. [[https://stars.bilkent.edu.tr/syllabus/view/GE/461/ME_BS/|STARS Syllabus]] **Prerequisites**: (CS 101 or CS114 or CS 115) and (MATH 230 or MATH 255 or MATH 260) and (MATH 225 or MATH 241 or MATH 220)\\ **Credits**: 3 **Course Management Systems:** [[https://moodle.bilkent.edu.tr/2021-2022-spring/course/view.php?id=360|Moodle]]\\ **Course Website:** http://www.cs.bilkent.edu.tr/~ge461/2022Spring ** Instructor Team** * S. Aksoy, C. Alkan, S. Arashloo, F. Can, E. Çiçek, T. Çukur, S. Dayanık, H. Dibeklioğlu, A. Dündar, İ. Körpeoğlu, C. Tekin, E. Tüzün\\ * Course Coordinator (contact point): S. Aksoy (saksoy AT cs.bilkent.edu.tr) **TAs** * Mohsen Moradi (moradi@ee.bilkent.edu.tr) * Osama Zafar (osama.zafar@bilkent.edu.tr) **Classroom and Hours** * Clasroom: **B-204** * Class hours: * Tue 10:30-12:20 * Thu 15:30-17:20 **Grading Policy** * Final: 40 % * Projects: 60 %. Multiple computer/programming/exercise assignments of various sizes. * There will be 5 projects. **Each project is 12 %**. ** Attendance** * Attendance is mandatory. A student who misses **more than 9 hours** will fail the course automatically. ** Exam** * The final exam will be held at EA-Z01 (for lastnames in the range ABDUL-GÖÇMEN) and EA-Z03 (for lastnames in the range GÖZÜBÜYÜK-YÜRÜTEN) during 9:00-12:00 on May 22, 2022. ** Projects** * Multiple computer/programming/exercise assignments of various sizes. * A project can be assigned earlier than the indicated date on the weekly plan. * Projects can be individual or group based. Instructors will decide. * Projects will be uploaded to Moodle. * Programming languages like Python, Java, R or Matlab can be used in the projects. * Gaining hands-on experience and experimenting will be important. Real world data sets can be used (economical/financial data sets, medical/biological data sets, image/video data sets, social network data sets, IT data sets, etc.). ** Other** * Grades will be posted in SAPS. * There is **no mandatory textbook** for the course. ---- ==== Week 1 (Feb 1, Feb 3) ==== **Introduction; what is data science; data science applications.** [Çiçek, Tüzün] \\ Topic Details: Introductory concepts in data science and applications. Overview of data science process.\\ Slides and Additional Material:{{ :ge_461_-_lecture_1_-_course_information_spring_2022.pdf |}}\\ Topic Details: Software engineering applications.\\ Slides and Additional Material:{{ :ge461_lecture_2_datascienceinsoftwareengineering.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: \\ Events: Classes begin (Jan 31). \\ ==== Week 2 (Feb 8, Feb 10) ==== **Data science applications; data science pipeline.** [Alkan, Dibeklioğlu] \\ Topic Details: Genomics applications.\\ Slides and Additional Material:\\ Topic Details: Computer vision applications.\\ Slides and Additional Material:{{ :ge_461_-_lecture_4_-_computer_vision_applications_-_spring_2022.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195|"Big Data: Astronomical or Genomical?"]], Stephens et al., 2015\\ Events: \\ ==== Week 3 (Feb 15, Feb 17) ===== **Data representation; preprocessing; preparation; crowdsourcing. ** [Arashloo, Çiçek] \\ Topic Details: Normalization, Noise Removal (Filtering), Anomaly Detection, Data Compression, Noise Removal (ICA).\\ Slides and Additional Material:{{ :preprocessing.pdf |}}\\ Topic Details: Crowdsourcing applications and usage in data science.\\ Slides and Additional Material:{{ :ge_461_-_lecture_6_-_crowdsourcing.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week\\ Events: \\ ==== Week 4 (Feb 22, Feb 24) ==== ** Data collection; storage; querying; SQL, NoSQL; cloud; distributed storage and computing. ** [Körpeoğlu] \\ Topic Details: RDMBs, SQL; SQLite, Pandas; NoSQL; MapReduce and Hadoop; Spark.\\ Slides and Additional Material: {{ :slides.pdf |Slides}} \\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.sqlite.org/index.html|SQLite]] [[https://pandas.pydata.org/docs/user_guide/index.html|Pandas]] [[https://spark.apache.org/|ApacheSpark]] [[https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf|Spark]] \\ Events: \\ ==== Week 5 (Mar 1, Mar 3) ==== **Basic models; parametric models; fitting. ** [S. Dayanık] \\ Topic Details: Exploratory data analysis, loess smoother, chi-squared test of independence, linear regression and least squares method, factors and dummy variables, all illustrated on //Dodgers Advertising and Promotion// case study with R, RStudio, and SQLite\\ Slides and Additional Material: * {{ :dodgers-week5.pdf | Dodgers promotion analysis started}} * {{ :week05_spring_2022.zip | Week05 course materials}} Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 6 (Mar 8, Mar 10) ==== ** Spring Break ** \\ Topic Details: \\ Slides: \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: Spring Break (March 10-13)\\ ==== Week 7 (Mar 15, Mar 17) ==== ** Application to customer choice problems (conjoint analysis) ** [S. Dayanık] \\ Topic Details: Part worths, part importance, their estimations from product rankings with multiple regression\\ Slides and Additional Material: * {{ :dodgers-w7.pdf | Dodgers promotion analysis completed}} * {{ :conjoint_notes.pdf | Introduction to conjoint analysis}} * {{ :week07_spring_2022.zip | Week07 course materials}} Project/Exercise-Problem-Set/Homework:** Dodgers Promotion Project** due **19:00 on Saturday, April 9** to be submitted on Moodle page. Project details are in the dodgers.Rmd/dodgers.html files inside Week 7 course materials\\ References: \\ Events: \\ ==== Week 8 (Mar 22, Mar 24) ==== ** Conjoint analysis continued, and authorship problem ** [S. Dayanık] \\ Topic Details: New product design with market simulation to increase overall market share; who wrote the Federalists papers (identiciation of authorships by means of Bayesian classifiers, kNN) \\ Slides and Additional Material: * {{ :conjoint.pdf | Conjoint analysis in R}} * {{ :federalist.pdf | Federalist papers analysis in R}} * {{ :week08_spring_2022.zip | Week08 course materials}}\\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 9 (Mar 29, Mar 31) ==== ** Dimensionality reduction; visualization.** [Aksoy] \\ Topic Details: Feature reduction, feature selection, high-dimensional data visualization.\\ Slides and Additional Material: {{ ::ge461_dimensionality.pdf |Dimensionality slides}}, {{ :knaw_t-sne_talk.pptx |t-SNE slides}} \\ Project/Exercise-Problem-Set/Homework: [{{ :ge461_project_dimensionality.pdf |Project}} ({{ :digits.zip |data}})] (due 23:59 on April 21, 2022)\\ References: [[https://www.mathworks.com/help/stats/dimensionality-reduction.html|Matlab: dimensionality reduction]], [[https://scikit-learn.org/stable/modules/decomposition.html|Scikit-learn: decomposition]], [[https://scikit-learn.org/stable/auto_examples/index.html#decomposition-examples|Scikit-learn: decomposition examples]], [[https://scikit-learn.org/stable/modules/manifold.html|Scikit-learn: manifold learning]], [[https://lvdmaaten.github.io/tsne/|t-SNE]]\\ Events: \\ ==== Week 10 (Apr 5, Apr 7) ==== ** Unsupervised learning, clustering. ** [Aksoy] \\ Topic Details: K-means clustering, mixture models, hierarchical clustering.\\ Slides and Additional Material: {{ ::ge461_clustering.pdf |Clustering slides}}\\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.mathworks.com/help/stats/cluster-analysis.html|Matlab: cluster analysis]], [[https://scikit-learn.org/stable/modules/clustering.html|Scikit-learn: clustering]]\\ Events: Bilkent Day (April 3)\\ ==== Week 11 (Apr 12, Apr 14) ==== ** Machine learning; supervised learning; classifiers; deep learning. ** [Dündar]\\ Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.\\ Slides and Additional Material: {{ ::ge461_supervisedlearning_part1.pdf |Part1}}, {{ ::ge461_supervisedlearning_part2.pdf |Part2}} \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 12 (Apr 19, Apr 21) ==== ** Machine learning; supervised learning; classifiers; deep learning.** [Dibeklioğlu] \\ Topic Details: Activation functions, convolutional neural networks, recurrent architectures.\\ Slides and Additional Material: {{ ::ge461_deep_learning_2022s.pdf}} \\ Project/Exercise-Problem-Set/Homework: [{{ ::GE461_project_supervised_learning.pdf |Project Description}} | {{ :data_supervised_learning_project.zip |Data}})] (due 23:55 on April 30, 2022)\\ References: \\ Events: \\ ==== Week 13 (Apr 26, Apr 28) ==== ** Machine learning in healthcare. ** [Çukur] \\ Topic Details: Healthcare analytics: diagnostics, medical imaging, in-patient care, hospital management, risk analytics, wearables. Deep learning architectures for medical applications; \\ Slides and Additional Material: {{ :ge461_ml_in_healthcare.pdf |}}\\ Project/Exercise-Problem-Set/Homework: (Due: 06/05/2022) {{ ::ge461_pw13_description.pdf |}} {{ :ge461_pw13_data.zip |}} \\ References: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Ch. 11 and 14; Mead, Analog VLSI and Neural Systems, Ch. 4; Bishop, Pattern Recognition and Machine Learning, Ch. 5\\ Events: Spring Festival (Apr 29-30)\\ ==== Week 14 (May 3, May 5) ==== ** Data mining; online data stream classification; applications.** [Can] \\ Topic Details: Concept drift, ensemble-based classification, text mining. \\ Slides and Additional Material: {{ ::ge461_datastreamminingspring22_ver2.pdf |DataStreamMining}}\\ Project/Exercise-Problem-Set/Homework:\\ References: \\ Events: Feast of Ramadan holiday (May 2-4)\\ ==== Week 15 (May 10, May 12) ==== ** Reinforcement learning; applications. ** [Tekin] \\ Topic Details: Applications of Reinforcement Learning, Markov Decision Processes, Value Iteration, Q Learning\\ Slides and Additional Material: {{ :ge461_reinforcementlearning.pdf |}} \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: Last day of classes (May 13)\\ ---- ==== Textbooks ==== * [[https://www.textbook.ds100.org/intro|Principles and Techniques of Data Science - Online]] * [[http://shop.oreilly.com/product/0636920023784.do|Python for Data Analysis, by Wes McKinney]] * [[http://shop.oreilly.com/product/0636920028529.do|Doing Data Science, by Cathy O’Neil and Rachel Schutt. O’Reilly. 2014.]] * [[https://www.oreilly.com/library/view/data-science-from/9781492041122/|Data Science from Scratch, second edition, O'Reilly, 2019.]] * [[http://shop.oreilly.com/product/0636920034919.do|Python Data Science Handbook, O'Reilly, 2016.]] * [[https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_sim_14_2?_encoding=UTF8&pd_rd_i=1491962291&pd_rd_r=a661bb45-d0b9-11e8-9fea-e722222b4194&pd_rd_w=hDZAL&pd_rd_wg=TW8F8&pf_rd_i=desktop-dp-sims&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=18bb0b78-4200-49b9-ac91-f141d61a1780&pf_rd_r=5H464VA0VJ0JFK1QFQXJ&pf_rd_s=desktop-dp-sims&pf_rd_t=40701&psc=1&refRID=5H464VA0VJ0JFK1QFQXJ| Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O'Reilly, 2017.]] * [[https://www.cs.ubc.ca/~murphyk/MLbook/|Machine Learning: a Probabilistic Perspective]] * [[https://www-bcf.usc.edu/~gareth/ISL/|An Introduction to Statistical Learning, R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.]] * [[https://www.springer.com/gp/book/9780387310732|Pattern Recognition and Machine Learning, Christopher Bishop]] * [[https://www.amazon.com/Neural-Networks-Learning-Machines-3rd/dp/0131471392|Neural Networks and Learning Machines]] * [[https://rafalab.github.io/dsbook/|Introduction to Data Science - Data Analysis and Prediction Algorithms with R, Rafael A. Irizarry. Online Book.]] * [[http://www.mmds.org/|Mining Massive Datasets, third edition, Ullman et al., 2020.]] ====Similar / Complementary Courses==== * [[https://bcourses.berkeley.edu/courses/1267848|CS194, Introduction to Data Science, Berkeley]] * [[http://data8.org/ | Data 8: Introduction to Data Science, Berkeley]] * [[http://www.ds100.org/ | Data 100: Principles and Techniques of Data Science, Berkeley]] * [[https://www.cs.purdue.edu/homes/neville/courses/CS24200.html|CS24200, Introduction to Data Science, Purdue]] * [[https://web.stanford.edu/class/stats101/|Data Science 101, Stanford]] * [[http://cs109.github.io/2015/index.html|CS109, Data Science, Harvard]] * [[https://www.cs.umd.edu/class/spring2017/cmsc320/|Introduction to Data Science I, Maryland]] * [[http://users.umiacs.umd.edu/~hcorrada/IntroDataSci/syllabus.html|Introduction to Data Science II, Maryland]] * [[https://www.eecs.wsu.edu/~assefaw/CptS483-06/|Introduction to Data Science, WSU]] * [[https://www.conted.ox.ac.uk/courses/applied-data-science|An Overview of Data Science, Oxford]] * [[https://www.cambridgenetwork.co.uk/events/applied-data-science-course-become-a-data-scientist-in-6-months/|Applied Data Science, Cambridge]] * [[https://studiegids.tudelft.nl/a101_displayCourse.do?restoreContext=true&SIS_SwitchLang=en&course_id=41759|Data Analysis, Delft]] * [[https://www.studocu.com/en/course/technische-universiteit-delft/programming-and-data-science-for-the-99/77128|Programming and Data Science for 99 Percent, Delft]] * [[https://datasciencedegree.wisconsin.edu/data-science-700-foundations-of-data-science/|Foundations of Data Science, Wisconsin]] * [[https://stars.bilkent.edu.tr/syllabus/view/CS/464/|Introduction to Machine Learning, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/443/EE_BS/| Neural Networks, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/485/EE_BS/| Statistical Learning and Data Analytics, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/IE/451/IE_BS/| Applied Data Analysis, Bikent]] * [[http://www.cs.bilkent.edu.tr/~gunduz/teaching/cs550/|Machine Learning, Bilkent]] * [[http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/|Pattern Recognition, Bilkent]] * [[http://cs.brown.edu/courses/csci1951-a/|CS1951A Data Science, Brown]] * [[http://www.datasciencecourse.org/|CMU15-388/688 Practical Data Science, CMU]] * [[https://www.hse.ru/data/2016/10/06/1087301973/program-869867030-nOE2xyWAyH.pdf| Introduction to Data Science, Russia]] * [[https://ci.uky.edu/sis/sites/default/files/syllabi/Syllabus-LIS690-Introduction%20to%20Data%20Science-20160101_1.pdf|LIS690, Introduction to Data Science, Kentucky]] ==== Tools, Libraries, Systems, Languages ==== * [[https://pandas.pydata.org/|Pandas: Python Data Analysis Library]] * [[https://aws.amazon.com/machine-learning/?sc_channel=PS&sc_campaign=acquisition_TR&sc_publisher=google&sc_medium=ACQ-P%7CPS-GO%7CNon-Brand%7CDesktop%7CSU%7CMachine%20Learning%7CMachine%20Learning%7CTR%7CEN%7CText&sc_content=ml_general_bmm&sc_detail=%2Bmachine%20%2Blearning&sc_category=Machine%20Learning&sc_segment=293640020615&sc_matchtype=b&sc_country=TR&s_kwcid=AL!4422!3!293640020615!b!!g!!%2Bmachine%20%2Blearning&ef_id=W4g8FAAAAM1g4jhU:20181015203314:s|Machine Learning on AWS]] * [[https://spark.apache.org/|Apache Spark]]