Archive | data science RSS for this section

Becoming a Data Scientist

Weka data mining program

It’s been a busy time since my last post, switching jobs, taking a full load of classes, and spending the remaining time with my family. Now that things are settling down into a predictable routine with some time to write, I really wanted to give an update on my path to becoming a data scientist.

I’m kicking off the fall session at the community college where I’m pursuing a certificate in computer programming with a class in C++, which so far has been rather easy, despite much of the curriculum being based on lecture notes that were created by the instructor, who is friendly and funny but not an expert in English. So, we’ll see how that goes.

The latest class I’ve taken in UCSD’s Data Mining certificate program, Data Preparation, was just about as bad as the Data Mining I class. The video lectures were not very instructive, as the teacher mainly resorted to simply reading the bullet points on the slides and constantly adding her seemingly favorite expression, “and so on and so forth”, without really explaining anything. On occasions, she can even be heard sighing deeply as if she was annoyed to be teaching the class. Never mind the fact that all the lessons were pre-recorded and that she never participated in the class’s online discussion board.

Thankfully, it seems that the data gods have taken notice of the suffering of newbie data geeks and have made available a handful of awesome and FREE online learning courses devoted to data mining.

  • My latest find is the perfect antidote for the overpriced crap that is UCSD: Data Mining with Weka, by the creator of Weka itself, Prof. Ian Witten. The video lectures are thoughtful, thankfully brief, and incorporate relevant hands-on exercises – all of which have been woefully absent in UCSD’s classes.
  • Not as comprehensive, but a fantastic and accessible introduction to R is Google’s very own Intro to R video series.
  • Another one that I haven’t started yet but looks very promising is Cal Tech’s Learning from Data course that teaches the basics of machine learning through the open-source tool Octave, which is like a free version of the popular Matlab application.
  • Finally, another great resource I’ve stumbled across is UCLA Institute of Digital Research and Education’s online library of tutorials and references for a variety of statistical tools, including R, SAS, SPSS, and STATA.

You Are What You Like

It was just a matter of time until someone realized that Facebook is a treasure trove of psycho-social data. When people are active on Facebook, liking this and that, it can create a pretty interesting picture of a person’s personality. With each  press of the Like button, I’m aware that I’m contributing to a highly revealing database of my predilections – my political leanings, my consumer interests, and what kind of news stories grab my attention.

WolframAlpha was one of the first to offer a social analytics tool for regular folk to mine their own data, but it mostly provides summary statistics. Now, real scientists from the University of Cambridge have uncovered deeper and more meaningful conclusions from a person’s Likes. I was pretty surprised at how accurate their tool was for me. Try it out for yourself.

As I’ve noted e…

As I’ve noted elsewhere, academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.

James Kobielus, Big Data evangelist at IBM, in his blog