User Tools

Site Tools


start

GE461: Introduction to Data Science - Spring 2024

Introduction to data science fundamentals, techniques and applications; data collection, preparation, storage and querying; parametric models for data; models and methods for fitting, analysis, evaluation, and validation; dimensionality reduction, visualization; various learning methods, classifiers, clustering, data and text mining; applications in diverse domains such as business, medicine, social networks, computer vision; breadth knowledge on topics and hands-on experience through projects and computer assignments. STARS Syllabus

Prerequisites: (CS 101 or CS114 or CS 115) and (MATH 230 or MATH 255 or MATH 260) and (MATH 225 or MATH 241 or MATH 220)
Credits: 3

Course Management Systems: Moodle
Course Website: http://www.cs.bilkent.edu.tr/~ge461/2024Spring

Instructor Team

  • S. Aksoy, C. Alkan, S. Arashloo, F. Can, E. Çiçek, T. Çukur, S. Dayanık, H. Dibeklioğlu, A. Dündar, İ. Körpeoğlu, C. Tekin, E. Tüzün
  • Course Coordinator (contact point): S. Aksoy (saksoy AT cs.bilkent.edu.tr)

TAs

  • Ali Azak (ali.azak AT bilkent.edu.tr)
  • Hakan Gökçesu (hgokcesu AT ee.bilkent.edu.tr)

Classroom and Hours

  • Clasroom: EE-317
  • Class hours:
    • Mon 13:30-15:20
    • Thu 08:30-10:20

Grading Policy

  • Final: 40 %
  • Projects: 60 %. Multiple computer/programming/exercise assignments of various sizes.
  • There will be 5 projects. Each project is 12 %.

Attendance

  • Attendance is mandatory. A student who misses more than 9 hours will fail the course automatically.

Exam

  • TBD

Projects

  • Multiple computer/programming/exercise assignments of various sizes.
  • A project can be assigned earlier than the indicated date on the weekly plan.
  • Projects can be individual or group based. Instructors will decide.
  • Projects will be uploaded to Moodle.
  • Programming languages like Python, Java, R or Matlab can be used in the projects.
  • Gaining hands-on experience and experimenting will be important. Real world data sets can be used (economical/financial data sets, medical/biological data sets, image/video data sets, social network data sets, IT data sets, etc.).

Other

  • Grades will be posted in SAPS.
  • There is no mandatory textbook for the course.

Week 1 (Jan 29, Feb 1)

Introduction; what is data science; data science applications. [Çiçek, Tüzün]
Topic Details: Introductory concepts in data science and applications. Overview of data science process.
Slides and Additional Material:
Topic Details: Software engineering applications.
Slides and Additional Material:ge_461_-_lecture_1_-_course_information_compressed.pdf
Project/Exercise-Problem-Set/Homework: None this week.
References:
Events:

Week 2 (Feb 5, Feb 8)

Data science applications; data science pipeline. [Alkan, Dibeklioğlu]
Topic Details: Genomics applications.
Slides and Additional Material:
Topic Details: Computer vision applications.
Slides and Additional Material: ge461_applications_vision_2024s.pdf
Project/Exercise-Problem-Set/Homework: None this week.
References: "Big Data: Astronomical or Genomical?", Stephens et al., 2015
Events:

Week 3 (Feb 12, Feb 15)

Data representation; preprocessing; preparation; crowdsourcing. [Arashloo, Çiçek]
Topic Details: Normalization, Noise Removal (Filtering), Anomaly Detection, Data Compression, Noise Removal (ICA).
Slides and Additional Material:2024_pre-processing.pdf
Topic Details: Crowdsourcing applications and usage in data science.
Slides and Additional Material:ge461-crowdsourcing.pdf
Project/Exercise-Problem-Set/Homework: None this week
Events:

Week 4 (Feb 19, Feb 22)

Data collection; storage; querying; SQL, NoSQL; cloud; distributed storage and computing. [Körpeoğlu]
Topic Details: RDMBs, SQL; SQLite, Pandas; NoSQL; MapReduce and Hadoop; Spark.
Slides and Additional Material: Slides
Project/Exercise-Problem-Set/Homework:
References: SQLite Pandas MapReduce ApacheHadoop ApacheSpark
Events:

Week 5 (Feb 26, Feb 29)

Basic models; parametric models; fitting. [S. Dayanık]
Topic Details: Exploratory data analysis, loess smoother, chi-squared test of independence, linear regression and least squares method, factors and dummy variables, all illustrated on Dodgers Advertising and Promotion case study with R, RStudio, and SQLite
Slides and Additional Material: Week05 Dodgers.zip
Project/Exercise-Problem-Set/Homework: Dodgers Project

  • Include all variables and conduct a full regression analysis of the problem.
  • Write a report for Dodgers management. Discuss your findings in plain English and support them with data analysis.
  • Submit Rmd and html files in a single zip file to Moodle by 19:00 Saturday on March 16. Late submissions will not be accepted.
  • Groups up to three people are fine. Name the zip file with the BilkentIDs of group members.

References: Posit (former RStudio) R SQLite R for Data Science Modern Data Science with R

  • Thomas W. Miller, Modeling Techniques in Predictive Analytics With Python and R: A Guide to Data Science
  • Baumer, Daniel, Kaplan and Horton, Modern Data Science with R, Second Edition
  • Wickham, Cetinkaya-Rundel, and Grolemund, R for Data Science, Second Edition

Events:

Week 6 (Mar 4)

Application to customer choice problems (conjoint analysis) [S. Dayanık]
Topic Details: Part worths, part importance, their estimations from product rankings with multiple regression, new product design with market simulation to increase overall market share.
Slides: Conjoint Analysis and Market Simulation
Project/Exercise-Problem-Set/Homework:
References:

  • B. K. Orme, Getting Started With Conjoint Analysis: Strategies for Product Design and Pricing Research
  • Miller, Marketing Data Science: Modeling Techniques in Predictive Analytics With R and Python

Events: Spring Break (Mar 7-8)

Week 7 (Mar 11, Mar 14)

Authorship problem, text analysis, and topic modeling [S. Dayanık]
Topic Details: Who wrote the Federalists papers (identiciation of authorships by means of Bayesian classifiers, kNN)
Slides and Additional Material:
Federalist Papers Analysis Latent Diriclet Allocation Graphical Model
Project/Exercise-Problem-Set/Homework:
References:

  • Hamilton, Jay, and Madison, The Federalist Papers
  • Mosteller and Wallace, Applied Bayesian and Classical Inference

Events:

Week 8 (Mar 18, Mar 21)

Dimensionality reduction; visualization. [Aksoy]
Topic Details: Feature reduction, feature selection, high-dimensional data visualization.
Slides and Additional Material: Dimensionality slides, t-SNE slides
Project/Exercise-Problem-Set/Homework: [Project (data)] (due 23:59 on April 7, 2024)
References: Matlab: dimensionality reduction, Scikit-learn: decomposition, Scikit-learn: decomposition examples, Scikit-learn: manifold learning, Matlab: data visualization, Matplotlib: data visualization, t-SNE
Events:

Week 9 (Mar 25, Mar 28)

Unsupervised learning, clustering. [Aksoy]
Topic Details: K-means clustering, mixture models, hierarchical clustering.
Slides and Additional Material: Clustering slides
Project/Exercise-Problem-Set/Homework:
References: Matlab: cluster analysis, Scikit-learn: clustering, Scikit-learn: clustering examples
Events:

Week 10 (Apr 1, Apr 4)

Machine learning; supervised learning; classifiers; deep learning. [Dündar]
Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.
Slides and Additional Material:
Project/Exercise-Problem-Set/Homework:
References:
Events: Bilkent Day (April 3)

Week 11 (Apr 8, Apr 11)

Ramadan Holiday

Week 12 (Apr 15, Apr 18)

Machine learning; supervised learning; classifiers; deep learning. [Dibeklioğlu]
Topic Details: Activation functions, convolutional neural networks, recurrent architectures.
Slides and Additional Material: ge461_deep_learning_2024s.pdf
Project/Exercise-Problem-Set/Homework:[Project Description | Data] (due 23:55 on April 27, 2024)
References:
Events:

Week 13 (Apr 22, Apr 25)

Machine learning in healthcare. [Çukur]
Topic Details: Healthcare analytics: diagnostics, medical imaging, in-patient care, hospital management, risk analytics, wearables. Deep learning architectures for medical applications;
Slides and Additional Material: ge461_ml_in_healthcare.pdf
Project/Exercise-Problem-Set/Homework: ge461_pw13_description.pdf ge461_pw13_data.zip
References: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Ch. 11 and 14; Mead, Analog VLSI and Neural Systems, Ch. 4; Bishop, Pattern Recognition and Machine Learning, Ch. 5
Events: National Sovereignty and Children's Day (Apr 23)

Week 14 (Apr 29, May 2)

Data mining; online data stream classification; applications. [Can]
Topic Details: Concept drift, ensemble-based classification, text mining.
Slides and Additional Material:
Project/Exercise-Problem-Set/Homework:
References:
Events: Labor and Solidarity Day (May 1)

Week 15 (May 6, May 9)

Week 16 (May 13, May 16)

Reinforcement learning; applications. [Tekin]
Topic Details: Applications of Reinforcement Learning, Markov Decision Processes, Value Iteration, Q Learning
Slides and Additional Material:
Project/Exercise-Problem-Set/Homework:
References:
Events:

Textbooks

Similar / Complementary Courses

Tools, Libraries, Systems, Languages

start.txt · Last modified: 2024/04/23 06:17 by ge461