66 Days of Data

This page contains my daily bit of data science learning for #66DaysofData from January 5th, 2021.

Day 1 - 5th January, 2021

Read all articles in the #DataDecember Initiative
Completed Module 1 of BCG Open-Access Data Science & Advanced Analytics Virtual Experience Program
- Designed a data science problem based on a given hypothesis

Day 2 - 6th January, 2021

Binged through Ken Jee’s Sports Analytics videos and content
- The 4 Types of Sports Analytics Projects - YouTube
- How YOU Can Land a Sports Analytics Job - YouTube
- Official Website: Playing Numbers - Sports Analytics Content for those Interested in Playing the Numbers
- Conferences and Blogs
  - MIT Sloan Sports Analytics Conference - Sports Analytics, Business, & Technology. (sloansportsconference.com)
  - The Harvard Sports Analysis Collective - The official blog of the Harvard Sports Analysis Collective
- Ken Jee’s Tips to get a job in Sports Analytics
  - Read atleast 2-3 articles per day
  - Learn skills and tools
  - Engage with community
  - Work for university teams
  - Do projects
  - Produce content
  - Reach out
Scraped a website on bribes paid in India using the rvest package in R (Site: http://ipaidabribe.com) and collected 1000 instances of such bribes in the last year
- 1000 is too low. However, I don’t have time currently with my assignments to run the system for longer to collect more data
- Potential project : Analyze bribes in India
  - Which department takes most bribe?
  - Which state collects most?
  - What is the sentiment of people writing the posts?
  - Bonus: Is the initiative by this website being used correctly?

Day 3 - 7th January, 2021

Worked with the rtweet package to collect tweets from twitter
- Used this to work on my winter assessment
- Resource used: https://mkearney.github.io/nicar_tworkshop

Day 4 - 8th January, 2021

Had a perfunctory read over the winning solutions to the Hateful Memes Challenge hosted on DrivenData
Listed out potential projects that I can work on during the next few weeks as a part of #66DaysofData
- Why?
  - It would be useless to do #66DaysofData if I don’t put it to practical use
  - Will help me improve my portfolio
- What types of projects?
  - The plan is to work on projects that involve at least one of the following concepts
    - Intensive data cleaning
    - Storytelling with data
    - Predictive analytics - Tabular data
    - Natural Language Processing
    - Computer Vision
    - SQL
    - End-to-end deployment or Building an ML system

Day 5 - 9th January, 2021

Brushed up SQL Basics
- SQL Cheatsheet
- SQL Order of Operations
- What project can I do to ensure I understand SQL better?
  - A web app that uses SQL? (But, I don’t like web development much)
  - A Python - SQL application? (I do not know what this means, should read more)
- What hands on experience do I have already?
  - MY470 lectures at my MSc

Day 6 - 10th January, 2021

Followed through on my SQL journey from yesterday with some hands on usage of the RSQLite package
- But, the question now is how can I use this new R + SQL skill to make something useful?
Also, watched The most powerful idea in data science - YouTube by Cassie Kozyrkov

Day 7 - 11th January, 2021

Performed and compared sentiment analysis outputs with the tidytext package in R and the sentimentr package
- https://www.tidytextmining.com/tidytext.html
- trinker/sentimentr: Dictionary based sentiment analysis that considers valence shifters (github.com)
Put some thought into using R + SQL for a project - Fixed on creating a dummy ETL system
- Main idea : Create a basic web app that can be used by the animal shelter where I interned
- Data : Create dummy data based on the original data the shelter collects
  - Store in excel sheets as that’s what the shelter does
- Write a web app in R that can load the dataset directly and deliver some visualizations
- Convert the results of the app into a PDF report and save it
  - Can be used to quickly generate monthly reports
  - Also give an option for the user to enter some other manual text or pointers into the report
- Resources

Day 8 - 12th January, 2021

Spent time going over my dissertation proposal and the data I will need to collect to make it happen
- What data do I need?
  - Qualitative?
  - Quantitative?
- How will I collect the data?
  - Use secondary resources?
  - Collect it first hand?
    - With human interaction?
    - Without human interaction?
- How will I store the data?
  - Local system?
  - Cloud?
  - Documentation?
- What ethical considerations must I be aware of?

Day 9 - 13th January, 2021

Read the article - https://hackernoon.com/going-from-not-being-able-to-code-to-deep-learning-hero-2ou34fh by Radek Osmulski
Went in-depth into R Markdown to make a decent final report for my winter assignment - R Markdown Cookbook (bookdown.org)

Day 10 - 14th January, 2021

Initiated work on https://www.widsconference.org/datathon.html
Also, I have secretly fallen in love with #R’s pipe operator ;)

Day 11 - 15th January, 2021

Continued work on the WiDS 2021 Datathon
Worked on the draft of an article summarizing my favourite tips on working on a data analytics project

Day 12 - 16th January, 2021

Published an article on @TDataScience summarizing my favorite tips to create a good Data Analytics project - https://towardsdatascience.com/how-to-make-a-data-analytics-project-that-people-want-to-read-47caea306570
Went through a Tableau crash course tutorial - https://www.youtube.com/watch?v=TPMlZxRRaBQ

Day 13 - 17th January, 2021

Read through Jeremey Howard’s and team’s Drivetrain approach - Designing great data products – O’Reilly
- Define your objective
- Understand the inputs that you can control - Levers
- Identify the data you need to collect
- Modelling

Day 14 - 18th January, 2021

Made my first WiDS 2021 challenge kernel public - https://kaggle.com/thedatabeast/wids-2021-tutorial…
- Resources used
  - https://machinelearningmastery.com/what-is-data-preparation-in-machine-learning/
  - https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python
  - https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
While doing so, brushed up on my rusty skills in data preparation for modelling tasks
Tried the lazypredict package to compare models, but hit memory issues

Day 15 - 19th January, 2021

Spent 15 mins to chalk out a rough plan for my first project under the #66DaysofData initiative. The idea is to build a simple image classifier and deploy it as a web application.

Hope to start work on it soon and complete it by mid-February.

Day 16 - 20th January, 2021

Dabbled with the modin package - Works faster than pandas, but works exactly like pandas
Read the first 2 chapters of Introduction to Statistical Learning

Day 17 - 21st January, 2021

Took a break from any kind of data science learning today.

Instead, I went ahead and dedicated the day for my grad school assignment. Will continue to work on this till I submit it after which I shall return to #66DaysofData

Day 18 - 25th January, 2021

Resumed things after my little break.

Kicked things of with working through the @fastdotai book’s second chapter. Fired up a kernel with under 10 lines of code - https://www.kaggle.com/thedatabeast/fast-ai-resnet18-2-epochs-99-9-accuracy

Not much, but surely a way to start my fast.ai journey!

Day 19 - 26th January

Heard the Decision Skills Q&A by Cassie Kozyrkov and Jenny Brown
- Making a decision takes effort
  - Decision making requires people to be motivated
  - Requires collaboration
  - Important to delegate decision making appropriately
  - Structuring, coordinating and assigning is important to make good decisions
- It is time society begins to treat “decision-making” as a skill just like engineering or singing
- Software and decision-making
  - Software can aid in decision making
  - It helps transcend the limitations of the human brain
  - But, software cannot replace the human in the decision-making process
  - A “one-size fits all” approach will not work in decision making software
- More data is not always better. It has to be more “good” data
- Decision making in its purest sense is about processing information
- The most important question data scientists need to be asking stakeholders
  - “What would it take to change your mind?”
  - Forces the stakeholders to think what exactly they want
Dived into the creation of pipelines in scikit-learn today - 6.1. Pipelines and composite estimators — scikit-learn 0.24.1 documentation (scikit-learn.org)

Break

I have been forced to take a break from 66 Days of Data due to health issues. I will be resuming the campaign soon after revamping my plan for the same.

Day 20 - 17th March, 2021

Completed SQL Basics from Alex Freberg’s Youtube Channel
Obtained the 10 days of statistics badge from Hackerrank

Day 21 - 18th March, 2021

Worked on SQL problems on Hackerrank
1. Obtained 2 stars on Hackerrank for SQL

Day 22 - 19th March, 2021

Continued working through SQL challenges on Hackerrank
Watched some SQL Intermediate videos by Alex Freberg - Highly recommend this to anyone who wants bite-sized lectures to get started with SQL

Day 23 - 20th March, 2021

Continued work on SQL
1. Focused on joins and unions. Found out that the venn diagram representation of SQL joins makes it so much more easier to visualize the output of a type of join

Day 24 - 21st March, 2021

Learnt Google Data Studio
1. Obtained a certificate from Google Analytics Academy for Introduction to Data Studio
Worked on a data engineering mini-project based on the course by Karolina Sowsinka
1. Link to the repository: ry05/spotify_data_pipeline: A basic data pipeline to learn some data engineering basics (github.com)
2. Link to Ms Sowsinka’s Data Engineering Playlist

Day 25 - 22nd March, 2021 to Day 33 - 30th March, 2021

Dedicated the last full week to the study of An Introduction to Statistical Learning - First Edition
1. Completed all chapters except chapter 7
2. Made notes that are available at ry05/ISL_Notes: Draft notes taken while reading An Introduction to Statistical Learning (github.com)