====== GE461: Introduction to Data Science - Spring 2024 ====== Introduction to data science fundamentals, techniques and applications; data collection, preparation, storage and querying; parametric models for data; models and methods for fitting, analysis, evaluation, and validation; dimensionality reduction, visualization; various learning methods, classifiers, clustering, data and text mining; applications in diverse domains such as business, medicine, social networks, computer vision; breadth knowledge on topics and hands-on experience through projects and computer assignments. [[https://stars.bilkent.edu.tr/syllabus/view/GE/461/|STARS Syllabus]] **Prerequisites**: (CS 101 or CS114 or CS 115) and (MATH 230 or MATH 255 or MATH 260) and (MATH 225 or MATH 241 or MATH 220)\\ **Credits**: 3 **Course Management Systems:** [[https://moodle.bilkent.edu.tr/2023-2024-spring/course/view.php?id=455|Moodle]]\\ **Course Website:** http://www.cs.bilkent.edu.tr/~ge461/2024Spring ** Instructor Team** * S. Aksoy, C. Alkan, S. Arashloo, F. Can, E. Çiçek, T. Çukur, S. Dayanık, H. Dibeklioğlu, A. Dündar, İ. Körpeoğlu, C. Tekin, E. Tüzün\\ * Course Coordinator (contact point): S. Aksoy (saksoy AT cs.bilkent.edu.tr) **TAs** * Ali Azak (ali.azak AT bilkent.edu.tr) * Hakan Gökçesu (hgokcesu AT ee.bilkent.edu.tr) **Classroom and Hours** * Clasroom: **EE-317** * Class hours: * Mon 13:30-15:20 * Thu 08:30-10:20 **Grading Policy** * Final: 40 % * Projects: 60 %. Multiple computer/programming/exercise assignments of various sizes. * There will be 5 projects. **Each project is 12 %**. ** Attendance** * Attendance is mandatory. A student who misses **more than 9 hours** will fail the course automatically. ** Exam** * TBD ** Projects** * Multiple computer/programming/exercise assignments of various sizes. * A project can be assigned earlier than the indicated date on the weekly plan. * Projects can be individual or group based. Instructors will decide. * Projects will be uploaded to Moodle. * Programming languages like Python, Java, R or Matlab can be used in the projects. * Gaining hands-on experience and experimenting will be important. Real world data sets can be used (economical/financial data sets, medical/biological data sets, image/video data sets, social network data sets, IT data sets, etc.). ** Other** * Grades will be posted in SAPS. * There is **no mandatory textbook** for the course. ---- ==== Week 1 (Jan 29, Feb 1) ==== **Introduction; what is data science; data science applications.** [Çiçek, Tüzün] \\ Topic Details: Introductory concepts in data science and applications. Overview of data science process.\\ Slides and Additional Material:\\ Topic Details: Software engineering applications.\\ Slides and Additional Material:{{ :ge_461_-_lecture_1_-_course_information_compressed.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: \\ Events: \\ ==== Week 2 (Feb 5, Feb 8) ==== **Data science applications; data science pipeline.** [Alkan, Dibeklioğlu] \\ Topic Details: Genomics applications.\\ Slides and Additional Material:\\ Topic Details: Computer vision applications.\\ Slides and Additional Material: {{ :ge461_applications_vision_2024s.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195|"Big Data: Astronomical or Genomical?"]], Stephens et al., 2015\\ Events: \\ ==== Week 3 (Feb 12, Feb 15) ===== **Data representation; preprocessing; preparation; crowdsourcing. ** [Arashloo, Çiçek] \\ Topic Details: Normalization, Noise Removal (Filtering), Anomaly Detection, Data Compression, Noise Removal (ICA).\\ Slides and Additional Material:{{ :2024_Pre-processing.pdf |}}\\ Topic Details: Crowdsourcing applications and usage in data science.\\ Slides and Additional Material:{{ :ge461-crowdsourcing.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week\\ Events: \\ ==== Week 4 (Feb 19, Feb 22) ==== ** Data collection; storage; querying; SQL, NoSQL; cloud; distributed storage and computing. ** [Körpeoğlu] \\ Topic Details: RDMBs, SQL; SQLite, Pandas; NoSQL; MapReduce and Hadoop; Spark.\\ Slides and Additional Material: {{ :slides.pdf |Slides}}\\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.sqlite.org/index.html|SQLite]] [[https://pandas.pydata.org/docs/user_guide/index.html|Pandas]] [[https://en.wikipedia.org/wiki/MapReduce|MapReduce]] [[https://hadoop.apache.org/|ApacheHadoop]] [[https://spark.apache.org/|ApacheSpark]]\\ Events: \\ ==== Week 5 (Feb 26, Feb 29) ==== **Basic models; parametric models; fitting. ** [S. Dayanık] \\ Topic Details: Exploratory data analysis, loess smoother, chi-squared test of independence, linear regression and least squares method, factors and dummy variables, all illustrated on //Dodgers Advertising and Promotion// case study with R, RStudio, and SQLite\\ Slides and Additional Material: {{ ::s2024_week05.zip | Week05 Dodgers.zip}} \\ Project/Exercise-Problem-Set/Homework: **Dodgers Project** * Include all variables and conduct a full regression analysis of the problem. * Write a report for Dodgers management. Discuss your findings in plain English and support them with data analysis. * Submit Rmd and html files in a single zip file to Moodle by **19:00 Saturday on March 16. Late submissions will not be accepted.** * Groups up to three people are fine. Name the zip file with the BilkentIDs of group members. \\ References: [[https://posit.co/ | Posit (former RStudio)]] [[https://www.r-project.org/|R]] [[https://www.sqlite.org/index.html|SQLite]] [[https://r4ds.hadley.nz/|R for Data Science]] [[https://mdsr-book.github.io/mdsr2e/|Modern Data Science with R]] * Thomas W. Miller, Modeling Techniques in Predictive Analytics With Python and R: A Guide to Data Science * Baumer, Daniel, Kaplan and Horton, Modern Data Science with R, Second Edition * Wickham, Cetinkaya-Rundel, and Grolemund, R for Data Science, Second Edition\\ Events: \\ ==== Week 6 (Mar 4) ==== ** Application to customer choice problems (conjoint analysis) ** [S. Dayanık] \\ Topic Details: Part worths, part importance, their estimations from product rankings with multiple regression, new product design with market simulation to increase overall market share.\\ Slides: {{ ::s2024_week06.pdf | Conjoint Analysis and Market Simulation}}\\ Project/Exercise-Problem-Set/Homework: \\ References: * B. K. Orme, Getting Started With Conjoint Analysis: Strategies for Product Design and Pricing Research * Miller, Marketing Data Science: Modeling Techniques in Predictive Analytics With R and Python\\ Events: Spring Break (Mar 7-8)\\ ==== Week 7 (Mar 11, Mar 14) ==== ** Authorship problem, text analysis, and topic modeling ** [S. Dayanık] \\ Topic Details: Who wrote the Federalists papers (identiciation of authorships by means of Bayesian classifiers, kNN) \\ Slides and Additional Material:\\ {{ :s2024_week07_federalist.pdf | Federalist Papers Analysis}} {{ ::s2024_week07_lda_annotated.pdf | Latent Diriclet Allocation Graphical Model}} \\ Project/Exercise-Problem-Set/Homework:\\ References: * Hamilton, Jay, and Madison, The Federalist Papers * Mosteller and Wallace, Applied Bayesian and Classical Inference\\ Events: \\ ==== Week 8 (Mar 18, Mar 21) ==== ** Dimensionality reduction; visualization.** [Aksoy] \\ Topic Details: Feature reduction, feature selection, high-dimensional data visualization.\\ Slides and Additional Material: {{ :ge461_dimensionality.pdf |Dimensionality slides}}, {{ :knaw_t-sne_talk.pptx |t-SNE slides}}\\ Project/Exercise-Problem-Set/Homework: [{{ :ge461_project_dimensionality.pdf |Project}} ({{ :fashion_mnist.zip |data}})] (due 23:59 on April 7, 2024)\\ References: [[https://www.mathworks.com/help/stats/dimensionality-reduction.html|Matlab: dimensionality reduction]], [[https://scikit-learn.org/stable/modules/decomposition.html|Scikit-learn: decomposition]], [[https://scikit-learn.org/stable/auto_examples/index.html#decomposition|Scikit-learn: decomposition examples]], [[https://scikit-learn.org/stable/modules/manifold.html|Scikit-learn: manifold learning]], [[https://www.mathworks.com/discovery/data-visualization.html|Matlab: data visualization]], [[https://matplotlib.org/|Matplotlib: data visualization]], [[https://lvdmaaten.github.io/tsne/|t-SNE]]\\ Events: \\ ==== Week 9 (Mar 25, Mar 28) ==== ** Unsupervised learning, clustering. ** [Aksoy] \\ Topic Details: K-means clustering, mixture models, hierarchical clustering.\\ Slides and Additional Material: {{ :ge461_clustering.pdf |Clustering slides}}\\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.mathworks.com/help/stats/cluster-analysis.html|Matlab: cluster analysis]], [[https://scikit-learn.org/stable/modules/clustering.html|Scikit-learn: clustering]], [[https://scikit-learn.org/stable/auto_examples/index.html#clustering|Scikit-learn: clustering examples]]\\ Events: \\ ==== Week 10 (Apr 1, Apr 4) ==== ** Machine learning; supervised learning; classifiers; deep learning. ** [Dündar]\\ Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.\\ Slides and Additional Material: \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: Bilkent Day (April 3)\\ ==== Week 11 (Apr 8, Apr 11) ==== ** Ramadan Holiday ** ==== Week 12 (Apr 15, Apr 18) ==== ** Machine learning; supervised learning; classifiers; deep learning.** [Dibeklioğlu] \\ Topic Details: Activation functions, convolutional neural networks, recurrent architectures.\\ Slides and Additional Material: {{ ::ge461_deep_learning_2024s.pdf |}} \\ Project/Exercise-Problem-Set/Homework:[{{ :GE461_project_supervised_learning_2024s.pdf |Project Description}} | {{ :data_supervised_learning_project.zip |Data}}] (due 23:55 on April 27, 2024)\\ References: \\ Events: \\ ==== Week 13 (Apr 22, Apr 25) ==== ** Machine learning in healthcare. ** [Çukur] \\ Topic Details: Healthcare analytics: diagnostics, medical imaging, in-patient care, hospital management, risk analytics, wearables. Deep learning architectures for medical applications; \\ Slides and Additional Material: {{ ::ge461_ml_in_healthcare.pdf |}}\\ Project/Exercise-Problem-Set/Homework: {{ :ge461_pw13_description.pdf |}} {{ :ge461_pw13_data.zip |}}\\ References: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Ch. 11 and 14; Mead, Analog VLSI and Neural Systems, Ch. 4; Bishop, Pattern Recognition and Machine Learning, Ch. 5\\ Events: National Sovereignty and Children's Day (Apr 23)\\ ==== Week 14 (Apr 29, May 2) ==== ** Data mining; online data stream classification; applications.** [Can] \\ Topic Details: Concept drift, ensemble-based classification, text mining. \\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework:\\ References: \\ Events: Labor and Solidarity Day (May 1)\\ ==== Week 15 (May 6, May 9) ==== ==== Week 16 (May 13, May 16) ==== ** Reinforcement learning; applications. ** [Tekin] \\ Topic Details: Applications of Reinforcement Learning, Markov Decision Processes, Value Iteration, Q Learning\\ Slides and Additional Material: {{ :ge461_reinforcementlearning.pdf |}} \\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Textbooks ==== * [[https://www.textbook.ds100.org/intro|Principles and Techniques of Data Science - Online]] * [[http://shop.oreilly.com/product/0636920023784.do|Python for Data Analysis, by Wes McKinney]] * [[http://shop.oreilly.com/product/0636920028529.do|Doing Data Science, by Cathy O’Neil and Rachel Schutt. O’Reilly. 2014.]] * [[https://www.oreilly.com/library/view/data-science-from/9781492041122/|Data Science from Scratch, second edition, O'Reilly, 2019.]] * [[http://shop.oreilly.com/product/0636920034919.do|Python Data Science Handbook, O'Reilly, 2016.]] * [[https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_sim_14_2?_encoding=UTF8&pd_rd_i=1491962291&pd_rd_r=a661bb45-d0b9-11e8-9fea-e722222b4194&pd_rd_w=hDZAL&pd_rd_wg=TW8F8&pf_rd_i=desktop-dp-sims&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=18bb0b78-4200-49b9-ac91-f141d61a1780&pf_rd_r=5H464VA0VJ0JFK1QFQXJ&pf_rd_s=desktop-dp-sims&pf_rd_t=40701&psc=1&refRID=5H464VA0VJ0JFK1QFQXJ| Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O'Reilly, 2017.]] * [[https://www.cs.ubc.ca/~murphyk/MLbook/|Machine Learning: a Probabilistic Perspective]] * [[https://www-bcf.usc.edu/~gareth/ISL/|An Introduction to Statistical Learning, R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.]] * [[https://www.springer.com/gp/book/9780387310732|Pattern Recognition and Machine Learning, Christopher Bishop]] * [[https://www.amazon.com/Neural-Networks-Learning-Machines-3rd/dp/0131471392|Neural Networks and Learning Machines]] * [[https://rafalab.github.io/dsbook/|Introduction to Data Science - Data Analysis and Prediction Algorithms with R, Rafael A. Irizarry. Online Book.]] * [[http://www.mmds.org/|Mining Massive Datasets, third edition, Ullman et al., 2020.]] ====Similar / Complementary Courses==== * [[https://bcourses.berkeley.edu/courses/1267848|CS194, Introduction to Data Science, Berkeley]] * [[http://data8.org/ | Data 8: Introduction to Data Science, Berkeley]] * [[http://www.ds100.org/ | Data 100: Principles and Techniques of Data Science, Berkeley]] * [[https://www.cs.purdue.edu/homes/neville/courses/CS24200.html|CS24200, Introduction to Data Science, Purdue]] * [[https://web.stanford.edu/class/stats101/|Data Science 101, Stanford]] * [[http://cs109.github.io/2015/index.html|CS109, Data Science, Harvard]] * [[https://www.cs.umd.edu/class/spring2017/cmsc320/|Introduction to Data Science I, Maryland]] * [[http://users.umiacs.umd.edu/~hcorrada/IntroDataSci/syllabus.html|Introduction to Data Science II, Maryland]] * [[https://www.eecs.wsu.edu/~assefaw/CptS483-06/|Introduction to Data Science, WSU]] * [[https://www.conted.ox.ac.uk/courses/applied-data-science|An Overview of Data Science, Oxford]] * [[https://www.cambridgenetwork.co.uk/events/applied-data-science-course-become-a-data-scientist-in-6-months/|Applied Data Science, Cambridge]] * [[https://studiegids.tudelft.nl/a101_displayCourse.do?restoreContext=true&SIS_SwitchLang=en&course_id=41759|Data Analysis, Delft]] * [[https://www.studocu.com/en/course/technische-universiteit-delft/programming-and-data-science-for-the-99/77128|Programming and Data Science for 99 Percent, Delft]] * [[https://datasciencedegree.wisconsin.edu/data-science-700-foundations-of-data-science/|Foundations of Data Science, Wisconsin]] * [[https://stars.bilkent.edu.tr/syllabus/view/CS/464/|Introduction to Machine Learning, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/443/EE_BS/| Neural Networks, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/485/EE_BS/| Statistical Learning and Data Analytics, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/IE/451/IE_BS/| Applied Data Analysis, Bikent]] * [[http://www.cs.bilkent.edu.tr/~gunduz/teaching/cs550/|Machine Learning, Bilkent]] * [[http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/|Pattern Recognition, Bilkent]] * [[http://cs.brown.edu/courses/csci1951-a/|CS1951A Data Science, Brown]] * [[http://www.datasciencecourse.org/|CMU15-388/688 Practical Data Science, CMU]] * [[https://www.hse.ru/data/2016/10/06/1087301973/program-869867030-nOE2xyWAyH.pdf| Introduction to Data Science, Russia]] * [[https://ci.uky.edu/sis/sites/default/files/syllabi/Syllabus-LIS690-Introduction%20to%20Data%20Science-20160101_1.pdf|LIS690, Introduction to Data Science, Kentucky]] ==== Tools, Libraries, Systems, Languages ==== * [[https://pandas.pydata.org/|Pandas: Python Data Analysis Library]] * [[https://aws.amazon.com/machine-learning/?sc_channel=PS&sc_campaign=acquisition_TR&sc_publisher=google&sc_medium=ACQ-P%7CPS-GO%7CNon-Brand%7CDesktop%7CSU%7CMachine%20Learning%7CMachine%20Learning%7CTR%7CEN%7CText&sc_content=ml_general_bmm&sc_detail=%2Bmachine%20%2Blearning&sc_category=Machine%20Learning&sc_segment=293640020615&sc_matchtype=b&sc_country=TR&s_kwcid=AL!4422!3!293640020615!b!!g!!%2Bmachine%20%2Blearning&ef_id=W4g8FAAAAM1g4jhU:20181015203314:s|Machine Learning on AWS]] * [[https://spark.apache.org/|Apache Spark]]