Welcome to my happy place! I love reading about relevant ideas making the circles in today’s data-centric world. In this page, I summarize some of my most favourite pieces. Hope you can appreciate beauty as much as I do.

Skip table of contents

Table of Contents


Data analysis write-ups by James Scott

Link to Article: Data analysis write-ups by James Scott (jgscott.github.io)

I have spent far too much time during my undergraduate degree making attempts to make sense of my data analysis projects when I revisit them a couple of months after I finished them. Such inability to understand my work often put me in some very uncomfortable situations where I only had ramblings as responses when people asked me about my projects. The fundamental problem was that “I didn’t have a defined system” for my analysis reports.

This piece by James Scott is exactly the guide I would have loved to have during those days! It’s short, to the point and most importantly actionable! Here are the key takeaways from the piece

  • Have a good outline for your analysis - Overview, Data and Modelling, Results and Conclusions
  • Be specific when talking about your data and your modelling approach
  • Make an attempt to tie “what you did” to “why you did it”
  • Use tables to depict results neatly
  • Have short sections
  • Don’t include code unless its asked for
  • Don’t narrate your thought process, stick to only what is important

So, where to go next?

Well, pick up a data analysis project you have just completed and evaluate its write-up!

Visualizations That Really Work by Scott Berinato

Link to Article: Visualizations That Really Work (hbr.org)

When it comes to data visualization, there are a zillion ways in which you can mess things up. This is especially true when we focus on creating visualizations before thinking about the data we are trying to showcase. Scott Berinato, author of the bestseller, Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations clearly puts forth ideas in this article to help the reader make effective visualizations. Here are some of the key takeaways from the piece

  • Decision making in the contemporary world is heavily driven by the ability to learn from data
  • Of late, technological advancements have made it relatively convenient for anyone to visualize data
    • However, the product of convenience is not necessarily good
    • Most people who visualize data tend to focus on the rules of creating charts and presentation
    • However, better visualizations tend to take shape when your eyes are on the message that you intend to deliver
  • Ask two main questions before getting into making those charts
    • Is the information conceptual or data-driven?
      • Conceptual can be thought of as qualitative data
      • Data-driven is quantitative data i.e that can be displayed statistically
    • Am I declaring something or exploring something?
      • When we declare, we are affirming an idea or a find
      • When we explore, we are allowing our audience to understand what happens when something changes
        • Exploration can be used to test a hypothesis
        • Exploration can also be used to mine data for patterns
  • Based on your answers to the 2 questions above, your visualization could fall into any of these 4 types

Source: https://hbr.org/2016/06/visualizations-that-really-work

Roles in a Data Team by Alexey Grigorev

Link to Article: Roles in a Data Team – DataTalks.Club

As the popularity of the term, data scientist increases by the day, most of us often forget that there are many other roles that work diligently in the data ecosystem in order to derive value from data. In this article, the author sheds some light on these other roles with the help of a case scenario. In the author’s view, there are 6 main data roles:

  1. Product managers - Ensure the team is building the right product
  2. Data analysts - Analyze data and figure out patterns
  3. Data scientists - Build models and incorporate into the product (often prototypes)
  4. Data engineers - Prep data for the analysts and data scientists
  5. Machine learning engineers - Productionize ML models
  6. MLOps engineers/Site reliability engineers - Enforce DevOps practices and ensure the infrastructure necessary for the product development works well

Wine & Math by Lars Verspohl

Link to Article: Wine & Math: A model pairing (pudding.cool)

If you are a data science enthusiast, this interactive article is a perfect way to begin with your journey into data. Personally, I find this article to be a fantastic and beautiful explanation of predictive modeling. It’s not too long, it’s clear and more importantly, it is born of the author’s attempt to simplify complicated jargon and math that is usually rampant in explanations of predictive models.

The author also talks about the importance of defining the measure we are trying to predict. I have often overlooked this in the past and dived into fitting a linear model straightaway. From experience, I can vouch that doing so does not put you at any sort of an advantage!

Another formidable idea that is often discussed in the context of modeling is about interpreting relations between the predictors and the target feature. Sometimes, understanding this relationship would prove more useful. For example, in the wine dataset, if we are able to understand how a parameter(let’s say alcohol content) impacts the overall quality(defined as taste) of the wine, then it would be possible to adjust the alcohol content in a way that the wine tastes better!


A Refresher on Statistical Significance by Amy Gallo

Link to Article: A Refresher on Statistical Significance (hbr.org)

One of the most significant ideas when it comes to data-driven decision making is that of statistical significance(no pun intended)! In a very basic sense, statistical significance helps understand how much an outcome depends on pure chance or luck. In this article, the author compiles some very useful thoughts around the same from her interaction with Tom Redman, a data quality expert and a prolific writer on HBR.

When it comes to running a data-driven business, there are two kinds of significance that would be of interest - Practical significance refers to whether a particular result is of business relevance while Statistical significance is the confidence that a result was not just a lucky shot.

When a result is not statistically significant, it means that the obtained result was due to chance and hence it is very unlikely that it can happen again. This can be translated roughly to don’t rely on this result to help your business. The 2 reasons why such a situation could arise are sampling error (the sample you have picked is not representative of the population) and variation in population (more the variation of data, the less likely your sample can represent the population)

When a result is not practically significant, it simply means that it is not of great advantage to act upon the result. For example, imagine you have a result that increasing your budget for advertising via the good old radio will bring about an increase in sales. This sounds good, right? Actually the answer is it depends. If the increase in sales is low when compared to what you are spending on radio advertising, it does not matter if the relationship between advertising budget and sales is statistically significant. You simply cannot proceed with this, at least if you want to stay in business. As Tom Redman says, “I’m all for using statistics, but always wed it with good judgement”

The rest of the article explores these oscillations between statistical and practical significance. Highly recommend this!


Everything We Wish We’d Known About Building Data Products by First Round Interview

Link to Article: Everything We Wish We’d Known About Building Data Products

I have never been much of a product development person, but all that changed a few months back when I heard this phenomenal talk on product management by Dave Wascha. As my proclivities lie towards anything and everything data, data products quickly caught my attention. However, I did have a very poor understanding of the concept till I came across this piece in my mailbox.

The article is a compilation by First Round Interview of a talk by DJ Patil(Chief Data Scientist of the United States) and Ruslan Belkin(VP Engineering at Salesforce). You would recognize DJ Patil as the brain behind the coining of the term, Data Scientist. The speakers used their time to share their most important mistakes, learning points in the context of data products. I summarize my favorite points from the piece here.

First, what’s a data product? Simply put, it’s any product that benefits users with the help of data. Even dashboards! The primary idea to keep in mind when initiating a project to build a data product is to accept the fundamental truth that data is super messy. A way to escape the pit of problems induced by poor data quality is to build simple products first.

The speakers also call for giving careful thought into the building of products with respect to whether the product works in a way that the users want it to. Data products must also be built in an iterative fashion. Each iteration must on its own be a fully fledged product, whose performance and usability increases with the number of iterations. This is a powerful idea as scaling up a product will work only if the product does not have issues with its functioning on the smaller scale.

Finally, the article provides a product pre-flight checklist that emphasizes on theses points - The product has to work, The product has to work for the user, The user has to feel safe, The users have to feel in control, Digital users are not constrained by geographic boundaries.


5 Principles of Data Ethics For Business by Catherine Cote

Link to Article: 5 Principles of Data Ethics for Business (hbs.edu)

The use of data to make decisions is an important part of what human society is. But, given the gargantuan amounts of data that is collected by the second, courtesy great technological advancement, its become increasingly difficult to gauge the ethical limitations of a data-driven culture. This piece is a very brief introduction to the main avenues of thought one could take when considering the ethical aspects of data science.

“Data ethics encompasses the moral obligations of gathering, protecting, and using personally identifiable information and how it affects individuals.”

The 5 principles of data ethics are

  • Ownership: Every individual has complete ownership over their personal ownership
  • Transparency: Subjects who provide their data must be informed about how their data will be collected, stored and used
  • Privacy: Personally Identifiable Information(PII) must be secured and not available for public access
  • Intention: The intentions behind a data collection process or a data analysis process has to be “good”
  • Outcomes: Consider the impacts of your data analysis before beginning any analysis to prevent potential disparate impacts or impacts that are negative to society even though the intentions were good

How AI Can Help Companies Set Prices More Ethically by Mark E. Bergen, Shantanu Dutta, James Guszcza and Mark J. Zbaracki

Link to Article: How AI Can Help Companies Set Prices More Ethically (hbr.org)

Ever wondered how the pricing strategy of businesses are? Not a marketing expert here, but I believe the most important consideration for pricing has always been the profit a company can gain. This is a somewhat self-centered view as the company is only catering to what can help it and ignores the impact it can have on consumers. In fact, it would not be too wrong to state that companies try their best to take advantage of its consumer’s needs. Don’t believe me? Think of the most common example - Pricing at $4.99 instead of GBP $5.00 or the 99-cent effect- This creates a psychological effect on the consumer who relates 4.99 closer to 4 than to 5.

This article explores how this need not be the case moving forward with the advent of AI. The authors emphasize that companies must look beyond data-driven pricing for profit and also include societal factors and values to price in a way that the consumers are also cared for. In order to identify if its pricing strategies are potentially harmful to consumers, a company must ask itself 3 important questions:

  1. What am I selling - and can these prices impede access to essential products?
  2. Who am I selling to - and can these prices harm vulnerable populations?
  3. How am I selling - and can these prices manipulate or take advantage of customers?

If answers to any of these questions is “yes”, then some steps to price more ethically are as follows:

  1. Pause, step back and get creative about your pricing strategy
  2. Be willing to make compromises(i.e cut in profits) and communicate this with stakeholders
  3. Incorporate this filter of ethical practices into your pricing deliberations

Using data to understand consumer behaviour and set prices that people are willing to pay for a product is being used by almost every profit-making company right now. Some do it on a large scale while smaller organizations do it on much smaller scales. However, it is now time to think about how pricing strategies can aid companies to support and protect the consumer base it thrives on.


The Netflix Data War by Roger Peng

Link to Article: The Netflix Data War · Simply Statistics

I watch a lot of shows and movies. In fact, I find these escapades into online media content as the source of some of my best ideas, including the projects I pursue as a data science enthusiast. In light of this, I have often been interested in what drives decisions to produce and market new content on OTT platforms.

In this article, the author dives a bit into the interactions between the data team and the other teams in a company by making use of the example in “At Netflix, Who Wins When It’s Hollywood vs. the Algorithm?”. The discussion includes some very relevant questions regarding

  • The role of both data and intuition in decisions
  • The need for transparency and communication across teams in a company

Personally, I believe its important as an “aspiring data scientist” to accept the fact that sometimes data is not the most important component of decision making. In some businesses, intuition and gut might play a far more prevalent role than data-driven ideas.


The Ghosts in the Data by Vicki Boykis

Link to Article: The ghosts in the data · Vicki Boykis (veekaybee.github.io)

In this absolutely phenomenal idea of looking at data work, the author considers the concept of implicit knowledge in data science. This implicit knowledge or “ghost knowledge” is knowledge about the data in use that is only known to an esoteric set of people. Therefore, it becomes difficult for the less-informed to deal with such data.

Some of these points of ghosts in the data are:

  • The power law distribution of data or “A few units have the most significant effects”, but you can’t overlook the units with lesser effects
  • Data collection has no universal method, much of it has to be acquired from experience and broadening the understanding of how others do it
  • Data science needs to include best practices of development
  • Working with people can be a really hard challenge
  • Many times, the gut trumps over data

Asking good questions is hard (but worth it) by Julia Evans

Link to Article: Asking good questions is hard (but worth it) (jvns.ca)

It’s common to hear people saying “ask questions”, however if you have tried it, you do realize its easier said than done. In this article, the author takes us a on a journey through the task of asking good questions. She shares some very practical insights and offers a great bunch of examples from her life to help shape our perspective.

My favorite points of the piece are:

  • Ask questions that contribute to the discovery of new knowledge
  • Ask questions that summarize your current understanding of the concept
  • Ask questions with a willingness to accept the response
  • Ask questions about the origin of a concept to understand where it all came from
  • Ask questions about how a system would react if perturbed
  • Practice asking good questions in your interactions with people

Data Cleaning IS Analysis, Not Grunt Work by Randy Au

Link to Article: Data Cleaning IS Analysis, Not Grunt Work - Counting Stuff (substack.com)

Data cleaning is usually thought of as the “Ugh! Why god, why?” step of data work. Yet, data cleaning is not taught to the extent at which other data disciplines are discussed. Data cleaning accounts for a very meagre amount of discussion in the classroom and the bits that are discussed are often confined to writing code to clean a specific dataset. The problem with such an approach to teaching data cleaning is that it fails to provide a framework or a thought process when dealing with messy data.

In this long piece, the author provides a framework to look at data cleaning as if it were data analysis (and he does strongly argue that cleaning is analysis). Here are my top takeaways from the piece:

  • Data cleaning is not just a systematic removal of “errors” and “problems”, its the extracting of signal from noise
  • Both cleaning and analysis are driven by conscious decisions to transform raw data into a form more suitable for extracting useful information
  • Cleaning is an important step to know your data
  • Data cleaning needs to be thought of as building reusable transformations
  • Tips to do better data cleaning
    • Don’t manipulate the original raw data
    • Have a paper trail of analysis
    • Fully document every cleaning decision

Why “Many-Model Thinkers” Make Better Decisions by Scott E. Page

Link to Article: Why “Many-Model Thinkers” Make Better Decisions (hbr.org)

Data is usually only as good as the way it is used. According to the author, it is important for organizations to leverage the use of models in order to aid decision making as data itself can’t recommend a plan of action. A model as per the author is “a formal representation of a domain or a process, often using variables and mathematical formulae”.

Some of the advantages of leveraging models in decision-making include the ability to use more data, the safety against logical errors, the flexibility of models that allow them to be tested and calibrated and the ability to compare different models in order to choose the most optimal representation.

The author then goes onto build his main case that multiple-models are better than a single model. This is an intuitive way to looking at real-world problems - having multiple models is representative of having different ways of seeing the same problem.

In order to put into practice the model combination paradigm, the author recommends 3 rules

  • Choose models that focus attention on different parts of the problem
  • Choose models that boost the performance of previous models
  • Choose models that conflict with other models (useful when the amount of data you have is little)

Marvel’s Blockbuster Machine by Spencer Harrison, Arne Carlsen and Miha Škerlavaj

Link to Article: Marvel’s Blockbuster Machine (hbr.org)

The reason why this piece is so relevant to me is because it describes a way of using data in a form I have personally never encountered. It uses data analysis to understand the key aspects of what makes Marvel’s movies the blockbuster phenomenon they are.

The authors collected a large amount of data including 243 interviews, 95 video interviews with producers, directors and writers, 140 reviews from leading critics, scripts and visual elements of 20 Marvel Cinematic Universe (MCU) movies. The main methods of analysis employed included qualitative analysis of the interviews, computerized text analysis of scripts and a visual analysis of the images in each movie.

This research contributes to a firm grounding on a couple of core ideas. The first is the obvious rationale behind how the MCU is able to dish out blockbuster after blockbuster. The second idea is about how the principles that help the MCU prosper can help other businesses to improve their operations.

Now I am a data-person. So, before talking about the findings of this research I first would like to talk about why I think this data-driven research is interesting. The research uses data that is relevant to solving the question at hand. The data also appears to be credible as its all data that apparently has been procured from sources that are in the best possible position to give an idea about the franchise. Another characteristic I find interesting is how the authors recognize that each data point is more than just a measurement or a collection of information. While this is not explicitly stated, it can be inferred with ease as the authors provide background information to help build the narrative around their findings.

The analysis methods used are not limited to a quantitative or qualitative approach. Instead, they are a combination of the two thus offering the authors the ability to perform their research without being handicapped by a limited analysis toolkit. When it comes to results, the authors found that Marvel adheres to 4 principles when making their films:

  1. Select for experienced inexperience: Hire people who have experience in a field that the MCU has not ventured into. For example, 14 of the 15 directors of the MCU films had no prior experience in making superhero films
  2. Leverage a stable core: Maintain a small percentage of people from one movie to the next. For example, on average 25% of the core creative group overlapped from one movie to the next
  3. Keep challenging the formula: Experiment with fresh ideas. For example, well do you need an example? It’s super evident to moviegoers that every MCU film offers a box of surprises. Some can be cruel too - I love you 3000
  4. Cultivate customer curiosity: Provoke an interest in characters. For example, the MCU places artifacts that certain movies revolve around in other movies that helps pique the interest of fans

In conclusion, the research identifies how Marvel’s creative process is not constrained like traditional creative processes. If applied together, these 4 principles can help any business be a “sustainable and ever-renewing innovation engine”(in the words of the authors).