Archive | tools RSS for this section

Mapping Data Made Easy

By the end of the Fusion Tables tutorial, you can create a useful population map of the Bay Area.

Wow. I’m loving Google more and more every day. It seems like they have a free service for just about anything you can imagine. Although I’ve heard about Google Fusion Tables before, I hadn’t really looked into it – until today.

First, I mapped a couple addresses and did simple things like change the color of the markers. Then, I went through this handy tutorial, which showed how to merge tables with publicly available data to create more powerful maps with polygons colored based on selected variables.

By the end of the day, I created a map of all the school districts in California. Next step, add metadata like district API scores.


Becoming a Data Scientist

Weka data mining program

It’s been a busy time since my last post, switching jobs, taking a full load of classes, and spending the remaining time with my family. Now that things are settling down into a predictable routine with some time to write, I really wanted to give an update on my path to becoming a data scientist.

I’m kicking off the fall session at the community college where I’m pursuing a certificate in computer programming with a class in C++, which so far has been rather easy, despite much of the curriculum being based on lecture notes that were created by the instructor, who is friendly and funny but not an expert in English. So, we’ll see how that goes.

The latest class I’ve taken in UCSD’s Data Mining certificate program, Data Preparation, was just about as bad as the Data Mining I class. The video lectures were not very instructive, as the teacher mainly resorted to simply reading the bullet points on the slides and constantly adding her seemingly favorite expression, “and so on and so forth”, without really explaining anything. On occasions, she can even be heard sighing deeply as if she was annoyed to be teaching the class. Never mind the fact that all the lessons were pre-recorded and that she never participated in the class’s online discussion board.

Thankfully, it seems that the data gods have taken notice of the suffering of newbie data geeks and have made available a handful of awesome and FREE online learning courses devoted to data mining.

  • My latest find is the perfect antidote for the overpriced crap that is UCSD: Data Mining with Weka, by the creator of Weka itself, Prof. Ian Witten. The video lectures are thoughtful, thankfully brief, and incorporate relevant hands-on exercises – all of which have been woefully absent in UCSD’s classes.
  • Not as comprehensive, but a fantastic and accessible introduction to R is Google’s very own Intro to R video series.
  • Another one that I haven’t started yet but looks very promising is Cal Tech’s Learning from Data course that teaches the basics of machine learning through the open-source tool Octave, which is like a free version of the popular Matlab application.
  • Finally, another great resource I’ve stumbled across is UCLA Institute of Digital Research and Education’s online library of tutorials and references for a variety of statistical tools, including R, SAS, SPSS, and STATA.

Hot Spots in Big Data

Big Data Opportunities by Industry

The biggest beneficiaries of big data would seem to be retail giants, like Amazon and Walmart. But according to a 2012 study by Gartner, the hottest industries for big data are banking, communications, government, and manufacturing. For details, check out this insightful (but ugly) infographic created with Tableau.

The Results Are In!

Probably the only thing that my $625 Data Mining I course through UCSD Extension was good for was the Discussion Board where fellow classmates offered their piece of mind about the class and valuable tips. One great lead was these poll results by KD Nuggets about the most used software tools in the world of data mining and big data.

The top 10 worth learning were:

  1. R
  2. Excel
  3. Rapid-I RapidMiner
  4. KNIME
  5. Weka / Pentaho
  6. StatSoft Statistica
  7. SAS
  8. Rapid-I RapidAnalytics
  10. IBM SPSS Statistics

This survey backs up James Kobielus’s claim in his blog that “open-source communities are where much of the fresh action in data science is happening”, as many of the tools preferred by those in the survey are indeed open-source. That’s great news because I don’t have that much money.