Data description with graphs and tables. Presentation of basic statistical measures for data description. Preparing Data. The importance of data pre-processing and clearing. Introduction to Databases. SQL. Introduction to supervised learning: decision trees, logistic regression. Introduction to regression: Multiple linear regression. Forecasts. Improving a model. The problems of over-fitting. Model Performance Evaluation. Dimensionality Reduction. Feature selection process. Principal Component Analysis with SVD. Un-supervised learning, Clustering. k-means algorithm. Application of Hierarchical Clustering models. Semi-supervised learning. Introduction to Metadata and Big Data. Computational Methods for Large Data (Hadoop and MapReduce).
Laboratory: (i) Introduction to the R language for Data Science. (ii) Create, select and compare categorical data using Factors. Save datasheets to Data Frames. Select data from a Data Frame and convert them to a Table. (iii) Basic graphics / visualization packages in R. (iv) Functions - Loops - Flow control. (v) Introduction to SQL. Queries. Queries on multiple tables with the JOIN. Subqueries. (vi) Rattle. (vii) R Hadoop.