Data And Machine Learning: All You Should Know

This is a book recommendation if you're thinking of going out there on your first date with data and Machine Learning.

My First Date

I almost didn't graduate (on time) just because I always thought that pushing the envelope and finding innovate ways of making one's life more difficult than it should be is exciting. I walked into the Data Science (Informatics back then) department at college and asked if I could do something for my thesis that nobody does. It was perfect timing because an article (that was before Google and the internet) just came out with a long story on something called Machine Learning (ML), specifically Artificial Neural Networks. My thesis chair handed me the article and suggested that I build a Neural Network that learns how to add two numbers together.

It was 25 years ago. Coding in C++ took months to create the algorithm and weeks to bugfix on an IMB 360 computer that took up half of our dorm room. The program was running for weeks of continuously learning on its own. (It was a multi-layered neural network using a back-propagation algorithm with sigmoid function [1].) Time to time, the network collapsed. I adjusted it. If I was too aggressive, it learned very fast but it exploded, and I had to start it from scratch. If I was too cautious, it learned way too slow to graduate that year. In the end, it worked out. The whole thing fit a floppy disc.

Today, ML Is At Your Fingertips

Today, these months would take minutes. Computers are monsters, algorithms are ready to be used with a line of code, communities like kaggle.com provide data, support, your own kernels, and training. Throwing in a couple of lines of Python, and you feel like you have superpowers. However, this superpower comes with some danger. Before Machine Learning, before anything that is sexy today about the mysterious Machine Learning or Artificial Intelligence, there is something you CANNOT skip: basic data literacy. It all starts with boring data.

Can Data Be Engaging? First Date Can Be Awkward...

Running algorithms such as neural networks, random forest, k-NN, multi-armed bandit, etc. takes no time today. It's like speed dating. Do you have a good question? They all come back with an answer. The problem is, this answer might not be the answer to your question. Think of these algorithms as experts specialized in looking at the world in a very specific way. They're extremely efficient at what they're doing but they (at least most of them) interpret the world very differently. The exact same world. These algorithms interpret the world through data. If you provide them the type and amount of data they need, they can do magic. But don't get carried away with their beauty. The first date can be awkward. You need to know their limitations, their expertise, and their "data language."

What's The Difference Between White And Red Wine?

I love when my different worlds of my passions come together in life. My father used to grow wine and I had to work in the vineyard as a kid. You can call this first-hand data collection about winemaking. One conclusion I came to without any ML that making wine is WAY too much more work than buying a good bottle in the store. But, I learned something that some of you may not know:

What's the difference between white and red wine?

Let's say you want to build an algorithm that is able to distinguish between red and white wine. Without knowing anything about the industry, you might think naturally that white wine comes from white grapes, and that red wine comes from red grapes. Therefore, if you can grab the data of the type of grape used for a particular wine, you don't really need any fancy algorithm to come up with the answer.

But it's not as easy as it sounds. Truth is, you can make white wine from red grapes (although, you can't make red wine with white grapes.) In Numsense! Data Science for the Layman: No Math Added [2], the authors explain the difference well:

The key difference lies in the way the grapes are fermented. For red wine, grape juice is fermented together with grape skins, which exude the distinctive red pigments, while white wine is fermented juice without the skin.

This book is a gold mine for those who want to go an a date with data and Machine Learning but not sure about the first date. The book does what the title promises: making sense of data for layman folks with not too much math involved. Back to the wine example. Understanding that the color comes from the grape skin during fermentation leads to a crucial data point: when the grape skin is part of the fermentation (and not just the juice as in white wine), chemicals that are present in the skin will be present in the wine as well. That's the reason why white wine often has added sulfur dioxide for preservation while red gape skin already has natural preservatives in it. (These chemicals can cause headache!) Now we can use a different data set from the wine itself: looking at chemical compounds.

k-NN To The Rescue!

Feeding that information into one of the Machine Learning (ML) algorithms, k-NN, with an ideal fit, we can predict wine color with over 98% accuracy.

98% sounds pretty accurate! Now, remember how my network exploded time to time? ML algorithms come with limitations. Not only do they depend on your data set size and quality, but also their fine-tuning. For example, k-NN (which is one of the simplest algorithms) is using nearest neighbors (NN) to predict an outcome. But there's this "k" that determines how far the algorithm should look. If you set it to too small, it only looks "the neighbor right next door," which may amplify errors in data due to random sampling. If you set the k too large, it includes "neighbors from nearby counties" that may dilute an patters you see by involving too many neighbors in the prediction. (In ML language they refer to the former as "overfitting", and "underfitting" to the latter.)

Conclusion

There's a lot more about Data Science and Data Analytics that we must be aware of before throwing ML algorithms against our data sets. Especially, when it comes to decision making. Playing around and practicing in kaggle.com is a good start but don't make the mistake of jumping to "rapid development" when your decisions will actually impact real business goals, real people because you will get an answer from the ML algorithm, it's just that it might not be the answer to your question.

(Engaging) Data In Learning

Nothing can engage people more than the relevance to their authentic needs. Using data beyond learning activities to provide insights how to best help them grow is where engagement starts. Data is everywhere. Build relationships, build data sets, build insights. But remember, there's a bigger WOR-L&D around L&D. Let's engage the WORLD&D!

References:

[1] Multi-Layer Neural Networks with Sigmoid Function— Deep Learning for Rookies (2)

[2] Numsense! Data Science for the Layman: No Math Added by Annalyn Ng and Kenneth Soo

Originally published on July 1, 2019

Originally published at www.linkedin.com