Trying to learn multiple programming languages is a pain in the head. Although many of the concepts and structures are similar, the syntax can be very different, as far as I can tell from my limited experience and knowledge. Currently, I’m tackling C++, R, and Stata, in addition to machine learning via the Weka application.
Despite the headache and nausea induced, I’m looking forward to the day when I master all these tools and am making a good living doing something gratifying.
It’s been a busy time since my last post, switching jobs, taking a full load of classes, and spending the remaining time with my family. Now that things are settling down into a predictable routine with some time to write, I really wanted to give an update on my path to becoming a data scientist.
I’m kicking off the fall session at the community college where I’m pursuing a certificate in computer programming with a class in C++, which so far has been rather easy, despite much of the curriculum being based on lecture notes that were created by the instructor, who is friendly and funny but not an expert in English. So, we’ll see how that goes.
The latest class I’ve taken in UCSD’s Data Mining certificate program, Data Preparation, was just about as bad as the Data Mining I class. The video lectures were not very instructive, as the teacher mainly resorted to simply reading the bullet points on the slides and constantly adding her seemingly favorite expression, “and so on and so forth”, without really explaining anything. On occasions, she can even be heard sighing deeply as if she was annoyed to be teaching the class. Never mind the fact that all the lessons were pre-recorded and that she never participated in the class’s online discussion board.
Thankfully, it seems that the data gods have taken notice of the suffering of newbie data geeks and have made available a handful of awesome and FREE online learning courses devoted to data mining.
- My latest find is the perfect antidote for the overpriced crap that is UCSD: Data Mining with Weka, by the creator of Weka itself, Prof. Ian Witten. The video lectures are thoughtful, thankfully brief, and incorporate relevant hands-on exercises – all of which have been woefully absent in UCSD’s classes.
- Not as comprehensive, but a fantastic and accessible introduction to R is Google’s very own Intro to R video series.
- Another one that I haven’t started yet but looks very promising is Cal Tech’s Learning from Data course that teaches the basics of machine learning through the open-source tool Octave, which is like a free version of the popular Matlab application.
- Finally, another great resource I’ve stumbled across is UCLA Institute of Digital Research and Education’s online library of tutorials and references for a variety of statistical tools, including R, SAS, SPSS, and STATA.
Data Science 101, by Ryan Swanstrom, a great resource for budding data scientists like me, recently posted a must-read list of all the concepts a data scientist should know. Here’s the list he came up with:
- linear algebra
- basic statistics
- linear and logistic regression
- data mining
- predictive modeling
- cluster analysis
- association rules
- market basket analysis
- decision trees
- time-series analysis
- machine learning
- Bayesian and Monte Carlo Statistics
- matrix operations
- text analytics
- primary components analysis
- experimental design
- unsupervised learning
- constrained optimization
Although I’m familiar with some of these, from my introductory statistics and data mining courses through UCSD Extension, there’s still a lot to be learned – and mastered.