====== GE461: Introduction to Data Science - Spring 2023 ====== Introduction to data science fundamentals, techniques and applications; data collection, preparation, storage and querying; parametric models for data; models and methods for fitting, analysis, evaluation, and validation; dimensionality reduction, visualization; various learning methods, classifiers, clustering, data and text mining; applications in diverse domains such as business, medicine, social networks, computer vision; breadth knowledge on topics and hands-on experience through projects and computer assignments. [[https://stars.bilkent.edu.tr/syllabus/view/GE/461/|STARS Syllabus]] **Prerequisites**: (CS 101 or CS114 or CS 115) and (MATH 230 or MATH 255 or MATH 260) and (MATH 225 or MATH 241 or MATH 220)\\ **Credits**: 3 **Course Management Systems:** [[https://moodle.bilkent.edu.tr/2022-2023-spring/course/view.php?id=361|Moodle]]\\ **Course Website:** http://www.cs.bilkent.edu.tr/~ge461/2023Spring ** Instructor Team** * S. Aksoy, C. Alkan, S. Arashloo, F. Can, T. Çukur, S. Dayanık, H. Dibeklioğlu, A. Dündar, İ. Körpeoğlu, C. Tekin, E. Tüzün\\ * Course Coordinator (contact point): S. Aksoy (saksoy AT cs.bilkent.edu.tr) **TAs** * Hakan Gökçesu (hgokcesu AT ee.bilkent.edu.tr) * Sayyed Ahmad Naghavi Nozad (ahmad.naghavi AT bilkent.edu.tr) **Classroom and Hours** * Clasroom: **B-Z06** * Class hours: * Mon 08:30-10:20 * Wed 13:30-15:20 **Grading Policy** * Final: 40 % * Projects: 60 %. Multiple computer/programming/exercise assignments of various sizes. * There will be 5 projects. **Each project is 12 %**. ** Attendance** * Attendance is mandatory. A student who misses **more than 9 hours** will fail the course automatically. ** Exam** * The final exam will be held at EB-103 (for lastnames in the range AKSOY-GÜZEY) and EB-104 (for lastnames in the range HAMURCU-YILDIZ) during 18:00-21:00 on June 10, 2023. ** Projects** * Multiple computer/programming/exercise assignments of various sizes. * A project can be assigned earlier than the indicated date on the weekly plan. * Projects can be individual or group based. Instructors will decide. * Projects will be uploaded to Moodle. * Programming languages like Python, Java, R or Matlab can be used in the projects. * Gaining hands-on experience and experimenting will be important. Real world data sets can be used (economical/financial data sets, medical/biological data sets, image/video data sets, social network data sets, IT data sets, etc.). ** Other** * Grades will be posted in SAPS. * There is **no mandatory textbook** for the course. ---- ==== Week 1 (Jan 30, Feb 1) ==== **Introduction; what is data science; data science applications.** [Aksoy, Tüzün] \\ Topic Details: Introductory concepts in data science and applications. Overview of data science process.\\ Slides and Additional Material: {{ ::ge461_lecture1_course_information.pdf |}}\\ Topic Details: Software engineering applications.\\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: \\ Events: Classes begin (Jan 30). \\ ==== Week 2 (Feb 6, Feb 8) ==== **Data science applications; data science pipeline.** [Alkan, Dibeklioğlu] \\ Topic Details: Genomics applications.\\ Slides and Additional Material: {{ :ge461_lectures_3_genomics_applications-spring2023.pdf |}}\\ Topic Details: Computer vision applications.\\ Slides and Additional Material: {{ ::ge461_applications_vision_2023s.pdf |}} \\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195|"Big Data: Astronomical or Genomical?"]], Stephens et al., 2015\\ Events: \\ ==== Week 3 (Feb 20, Feb 22) ===== **Crowdsourcing; Data representation; preprocessing; preparation;** [Arashloo] \\ Topic Details: Crowdsourcing applications and usage in data science.\\ Topic Details: Normalization, Noise Removal (Filtering), Anomaly Detection, Data Compression, Noise Removal (ICA).\\ Slides and Additional Material:{{ ::Crowdsourcing.pdf |}} \\ Slides and Additional Material:{{ ::Preprocessing.pdf |}} \\ Project/Exercise-Problem-Set/Homework: None this week\\ Events: \\ ==== Week 4 (Feb 27, Mar 1) ==== ** Data collection; storage; querying; SQL, NoSQL; cloud; distributed storage and computing. ** [Körpeoğlu] \\ Topic Details: RDMBs, SQL; SQLite, Pandas; NoSQL; MapReduce and Hadoop; Spark.\\ Slides and Additional Material:{{ :data-storage-and-processing.pdf |}} \\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.sqlite.org/index.html|SQLite]] [[https://pandas.pydata.org/docs/user_guide/index.html|Pandas]] [[https://spark.apache.org/|ApacheSpark]] [[https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf|Spark]] \\ Events: \\ ==== Week 5 (Mar 6, Mar 8) ==== ** Spring Break ** \\ Topic Details: \\ Slides: \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: Spring Break (Mar 6-8)\\ ==== Week 6 (Mar 13, Mar 15) ==== **Basic models; parametric models; fitting. ** [S. Dayanık] \\ Topic Details: Exploratory data analysis, loess smoother, chi-squared test of independence\\ Slides and Additional Material: {{ ::s2023_week06.zip |}} \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 7 (Mar 20, Mar 22) ==== ** Linear regression, goodness of fit ** [S. Dayanık] \\ Topic Details: linear regression and least squares method, factors and dummy variables, analysis of variance\\ Slides and Additional Material: {{ ::s2023_week07.zip |}}\\ Project/Exercise-Problem-Set/Homework:\\ References: \\ Events: \\ ==== Week 8 (Mar 27, Mar 29) ==== ** Diagnostic plot, nested and unnested model comparisons ** [S. Dayanık] \\ Topic Details: Hypothesis testing, confidence intervals, prediction intervals \\ Slides and Additional Material: {{ ::s2023_week08.zip |}}\\ Project: **Complete Analysis of Dodgers Advertising and Promotion Study** due 19:00 on Sunday, April 23. Details are in dodgers.html in zip file\\ References: \\ Events: \\ ==== Week 9 (Apr 3, Apr 5) ==== ** Dimensionality reduction; visualization.** [Aksoy] \\ Topic Details: Feature reduction, feature selection, high-dimensional data visualization.\\ Slides and Additional Material: {{ :ge461_dimensionality.pdf |Dimensionality slides}}, {{ :knaw_t-sne_talk.pptx |t-SNE slides}}\\ Project/Exercise-Problem-Set/Homework: [{{ :ge461_project_dimensionality.pdf |Project}} ({{ :digits.zip |data}})] (due 23:59 on May 7, 2023)\\ References: [[https://www.mathworks.com/help/stats/dimensionality-reduction.html|Matlab: dimensionality reduction]], [[https://scikit-learn.org/stable/modules/decomposition.html|Scikit-learn: decomposition]], [[https://scikit-learn.org/stable/auto_examples/index.html#decomposition|Scikit-learn: decomposition examples]], [[https://scikit-learn.org/stable/modules/manifold.html|Scikit-learn: manifold learning]], [[https://www.mathworks.com/discovery/data-visualization.html|Matlab: data visualization]], [[https://matplotlib.org/|Matplotlib: data visualization]], [[https://lvdmaaten.github.io/tsne/|t-SNE]]\\ Events: Bilkent Day (April 3)\\ ==== Week 10 (Apr 10, Apr 12) ==== ** Midterm Weak ** ==== Week 11 (Apr 17, Apr 19) ==== ** Unsupervised learning, clustering. ** [Aksoy] \\ Topic Details: K-means clustering, mixture models, hierarchical clustering.\\ Slides and Additional Material: {{ :ge461_clustering.pdf |Clustering slides}}\\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.mathworks.com/help/stats/cluster-analysis.html|Matlab: cluster analysis]], [[https://scikit-learn.org/stable/modules/clustering.html|Scikit-learn: clustering]], [[https://scikit-learn.org/stable/auto_examples/index.html#clustering|Scikit-learn: clustering examples]]\\ Events: Feast of Ramadan holiday (Apr 21-23), National Sovereignty and Children's Day holiday (Apr 23)\\ {{ :ge461_supervisedlearning_part1.pdf |{{ :ge461_supervisedlearning_part2.pdf |}}}} ==== Week 12 (Apr 24, Apr 26) ==== ** Machine learning; supervised learning; classifiers; deep learning. ** [Dündar]\\ Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.\\ Slides and Additional Material: {{ :ge461_supervisedlearning_part1.pdf |}}\\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 13 (May 1, May 3) ==== ** Machine learning; supervised learning; classifiers; deep learning. ** [Dündar]\\ Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.\\ Slides and Additional Material: {{ :ge461_supervisedlearning_part2.pdf |}} \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: Labor and Solidarity Day holiday (May 1)\\ ==== Week 14 (May 8, May 10) ==== ** Machine learning; supervised learning; classifiers; deep learning.** [Dibeklioğlu] \\ Topic Details: Activation functions, convolutional neural networks, recurrent architectures.\\ Slides and Additional Material: {{ ::ge461_deep_learning_2023s.pdf |}} \\ Project/Exercise-Problem-Set/Homework:[{{ :GE461_project_supervised_learning_2023s.pdf |Project Description}} | {{ :data_supervised_learning_project.zip |Data}}] (due 23:55 on May 22, 2023)\\ References: \\ Events: \\ ==== Week 15 (May 15, May 17) ==== ** Machine learning in healthcare. ** [Çukur] \\ Topic Details: Healthcare analytics: diagnostics, medical imaging, in-patient care, hospital management, risk analytics, wearables. Deep learning architectures for medical applications; \\ Slides and Additional Material: {{ :ge461_ml_in_healthcare.pdf |}}\\ Project: {{ :ge461_pw13_description.pdf |}} {{ :ge461_pw13_data.zip |}}\\ References: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Ch. 11 and 14; Mead, Analog VLSI and Neural Systems, Ch. 4; Bishop, Pattern Recognition and Machine Learning, Ch. 5\\ Events: \\ ==== Week 16 (May 22, May 24) ==== ** Data mining; online data stream classification; applications.** [Can] \\ Topic Details: Concept drift, ensemble-based classification, text mining. \\ Slides and Additional Material: {{ :GE461_dataStreamMiningSpring23.pdf |}} {{ :GE461_dataStreamHWspringVer2_2023.pdf |}}\\ Project/Exercise-Problem-Set/Homework: {{ ge461_datastreamhwspringver1_2023_2.pdf |}}\\ References: \\ Events: \\ ==== Week 17 (May 29, May 31) ==== ** Reinforcement learning; applications. ** [Tekin] \\ Topic Details: Applications of Reinforcement Learning, Markov Decision Processes, Value Iteration, Q Learning, Multi-armed bandits \\ Slides and Additional Material: https://www.dropbox.com/s/65h9melvnvuml2x/ge461_reinforcementlearning.pdf?dl=0 \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ---- ==== Textbooks ==== * [[https://www.textbook.ds100.org/intro|Principles and Techniques of Data Science - Online]] * [[http://shop.oreilly.com/product/0636920023784.do|Python for Data Analysis, by Wes McKinney]] * [[http://shop.oreilly.com/product/0636920028529.do|Doing Data Science, by Cathy O’Neil and Rachel Schutt. O’Reilly. 2014.]] * [[https://www.oreilly.com/library/view/data-science-from/9781492041122/|Data Science from Scratch, second edition, O'Reilly, 2019.]] * [[http://shop.oreilly.com/product/0636920034919.do|Python Data Science Handbook, O'Reilly, 2016.]] * [[https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_sim_14_2?_encoding=UTF8&pd_rd_i=1491962291&pd_rd_r=a661bb45-d0b9-11e8-9fea-e722222b4194&pd_rd_w=hDZAL&pd_rd_wg=TW8F8&pf_rd_i=desktop-dp-sims&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=18bb0b78-4200-49b9-ac91-f141d61a1780&pf_rd_r=5H464VA0VJ0JFK1QFQXJ&pf_rd_s=desktop-dp-sims&pf_rd_t=40701&psc=1&refRID=5H464VA0VJ0JFK1QFQXJ| Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O'Reilly, 2017.]] * [[https://www.cs.ubc.ca/~murphyk/MLbook/|Machine Learning: a Probabilistic Perspective]] * [[https://www-bcf.usc.edu/~gareth/ISL/|An Introduction to Statistical Learning, R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.]] * [[https://www.springer.com/gp/book/9780387310732|Pattern Recognition and Machine Learning, Christopher Bishop]] * [[https://www.amazon.com/Neural-Networks-Learning-Machines-3rd/dp/0131471392|Neural Networks and Learning Machines]] * [[https://rafalab.github.io/dsbook/|Introduction to Data Science - Data Analysis and Prediction Algorithms with R, Rafael A. Irizarry. Online Book.]] * [[http://www.mmds.org/|Mining Massive Datasets, third edition, Ullman et al., 2020.]] ====Similar / Complementary Courses==== * [[https://bcourses.berkeley.edu/courses/1267848|CS194, Introduction to Data Science, Berkeley]] * [[http://data8.org/ | Data 8: Introduction to Data Science, Berkeley]] * [[http://www.ds100.org/ | Data 100: Principles and Techniques of Data Science, Berkeley]] * [[https://www.cs.purdue.edu/homes/neville/courses/CS24200.html|CS24200, Introduction to Data Science, Purdue]] * [[https://web.stanford.edu/class/stats101/|Data Science 101, Stanford]] * [[http://cs109.github.io/2015/index.html|CS109, Data Science, Harvard]] * [[https://www.cs.umd.edu/class/spring2017/cmsc320/|Introduction to Data Science I, Maryland]] * [[http://users.umiacs.umd.edu/~hcorrada/IntroDataSci/syllabus.html|Introduction to Data Science II, Maryland]] * [[https://www.eecs.wsu.edu/~assefaw/CptS483-06/|Introduction to Data Science, WSU]] * [[https://www.conted.ox.ac.uk/courses/applied-data-science|An Overview of Data Science, Oxford]] * [[https://www.cambridgenetwork.co.uk/events/applied-data-science-course-become-a-data-scientist-in-6-months/|Applied Data Science, Cambridge]] * [[https://studiegids.tudelft.nl/a101_displayCourse.do?restoreContext=true&SIS_SwitchLang=en&course_id=41759|Data Analysis, Delft]] * [[https://www.studocu.com/en/course/technische-universiteit-delft/programming-and-data-science-for-the-99/77128|Programming and Data Science for 99 Percent, Delft]] * [[https://datasciencedegree.wisconsin.edu/data-science-700-foundations-of-data-science/|Foundations of Data Science, Wisconsin]] * [[https://stars.bilkent.edu.tr/syllabus/view/CS/464/|Introduction to Machine Learning, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/443/EE_BS/| Neural Networks, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/485/EE_BS/| Statistical Learning and Data Analytics, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/IE/451/IE_BS/| Applied Data Analysis, Bikent]] * [[http://www.cs.bilkent.edu.tr/~gunduz/teaching/cs550/|Machine Learning, Bilkent]] * [[http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/|Pattern Recognition, Bilkent]] * [[http://cs.brown.edu/courses/csci1951-a/|CS1951A Data Science, Brown]] * [[http://www.datasciencecourse.org/|CMU15-388/688 Practical Data Science, CMU]] * [[https://www.hse.ru/data/2016/10/06/1087301973/program-869867030-nOE2xyWAyH.pdf| Introduction to Data Science, Russia]] * [[https://ci.uky.edu/sis/sites/default/files/syllabi/Syllabus-LIS690-Introduction%20to%20Data%20Science-20160101_1.pdf|LIS690, Introduction to Data Science, Kentucky]] ==== Tools, Libraries, Systems, Languages ==== * [[https://pandas.pydata.org/|Pandas: Python Data Analysis Library]] * [[https://aws.amazon.com/machine-learning/?sc_channel=PS&sc_campaign=acquisition_TR&sc_publisher=google&sc_medium=ACQ-P%7CPS-GO%7CNon-Brand%7CDesktop%7CSU%7CMachine%20Learning%7CMachine%20Learning%7CTR%7CEN%7CText&sc_content=ml_general_bmm&sc_detail=%2Bmachine%20%2Blearning&sc_category=Machine%20Learning&sc_segment=293640020615&sc_matchtype=b&sc_country=TR&s_kwcid=AL!4422!3!293640020615!b!!g!!%2Bmachine%20%2Blearning&ef_id=W4g8FAAAAM1g4jhU:20181015203314:s|Machine Learning on AWS]] * [[https://spark.apache.org/|Apache Spark]]