As I’ve noted elsewhere, academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.
James Kobielus, Big Data evangelist at IBM, in his blog
You know the concept of data analytics has hit the big time when people start analyzing porn. Deep Inside, a fascinating visualization project gave me a laugh, but its findings were also pretty revealing about the culture and science of sex.
These findings made me laugh and think at the same time, which is not often:
- Even though 32.7% of female porn stars have blonde hair, only 5% of Americans are naturally blonde.
- A whopping 70.5% of female actresses are Caucasian. What does this say about the audience?
- 90.5% of adult entertainment is produced in my backyard.
Probably the only thing that my $625 Data Mining I course through UCSD Extension was good for was the Discussion Board where fellow classmates offered their piece of mind about the class and valuable tips. One great lead was these poll results by KD Nuggets about the most used software tools in the world of data mining and big data.
The top 10 worth learning were:
- Rapid-I RapidMiner
- Weka / Pentaho
- StatSoft Statistica
- Rapid-I RapidAnalytics
- IBM SPSS Statistics
This survey backs up James Kobielus’s claim in his blog that “open-source communities are where much of the fresh action in data science is happening”, as many of the tools preferred by those in the survey are indeed open-source. That’s great news because I don’t have that much money.
Data Science 101, by Ryan Swanstrom, a great resource for budding data scientists like me, recently posted a must-read list of all the concepts a data scientist should know. Here’s the list he came up with:
- linear algebra
- basic statistics
- linear and logistic regression
- data mining
- predictive modeling
- cluster analysis
- association rules
- market basket analysis
- decision trees
- time-series analysis
- machine learning
- Bayesian and Monte Carlo Statistics
- matrix operations
- text analytics
- primary components analysis
- experimental design
- unsupervised learning
- constrained optimization
Although I’m familiar with some of these, from my introductory statistics and data mining courses through UCSD Extension, there’s still a lot to be learned – and mastered.