This page contains my daily bit of data science learning for #66DaysofData from January 5th, 2021.
Day 1 - 5th January, 2021
- Read all articles in the #DataDecember Initiative
- Completed Module 1 of BCG Open-Access Data Science & Advanced Analytics Virtual Experience Program
- Designed a data science problem based on a given hypothesis
Day 2 - 6th January, 2021
- Binged through Ken Jee’s Sports Analytics videos and content
- The 4 Types of Sports Analytics Projects - YouTube
- How YOU Can Land a Sports Analytics Job - YouTube
- Official Website: Playing Numbers - Sports Analytics Content for those Interested in Playing the Numbers
- Conferences and Blogs
- Ken Jee’s Tips to get a job in Sports Analytics
- Read atleast 2-3 articles per day
- Learn skills and tools
- Engage with community
- Work for university teams
- Do projects
- Produce content
- Reach out
- Scraped a website on bribes paid in India using the rvest package in R (Site: http://ipaidabribe.com) and collected 1000 instances of such bribes in the last year
- 1000 is too low. However, I don’t have time currently with my assignments to run the system for longer to collect more data
- Potential project : Analyze bribes in India
- Which department takes most bribe?
- Which state collects most?
- What is the sentiment of people writing the posts?
- Bonus: Is the initiative by this website being used correctly?
Day 3 - 7th January, 2021
- Worked with the rtweet package to collect tweets from twitter
- Used this to work on my winter assessment
- Resource used: https://mkearney.github.io/nicar_tworkshop
Day 4 - 8th January, 2021
- Had a perfunctory read over the winning solutions to the Hateful Memes Challenge hosted on DrivenData
- Listed out potential projects that I can work on during the next few weeks as a part of #66DaysofData
- Why?
- It would be useless to do #66DaysofData if I don’t put it to practical use
- Will help me improve my portfolio
- What types of projects?
- The plan is to work on projects that involve at least one of the following concepts
- Intensive data cleaning
- Storytelling with data
- Predictive analytics - Tabular data
- Natural Language Processing
- Computer Vision
- SQL
- End-to-end deployment or Building an ML system
- The plan is to work on projects that involve at least one of the following concepts
- Why?
Day 5 - 9th January, 2021
- Brushed up SQL Basics
- SQL Cheatsheet
- SQL Order of Operations
- What project can I do to ensure I understand SQL better?
- A web app that uses SQL? (But, I don’t like web development much)
- A Python - SQL application? (I do not know what this means, should read more)
- What hands on experience do I have already?
- MY470 lectures at my MSc
Day 6 - 10th January, 2021
- Followed through on my SQL journey from yesterday with some hands on usage of the RSQLite package
- But, the question now is how can I use this new R + SQL skill to make something useful?
- Also, watched The most powerful idea in data science - YouTube by Cassie Kozyrkov
Day 7 - 11th January, 2021
- Performed and compared sentiment analysis outputs with the tidytext package in R and the sentimentr package
-
Put some thought into using R + SQL for a project - Fixed on creating a dummy ETL system
- Main idea : Create a basic web app that can be used by the animal shelter where I interned
- Data : Create dummy data based on the original data the shelter collects
- Store in excel sheets as that’s what the shelter does
- Write a web app in R that can load the dataset directly and deliver some visualizations
- Convert the results of the app into a PDF report and save it
- Can be used to quickly generate monthly reports
- Also give an option for the user to enter some other manual text or pointers into the report
- Resources
Day 8 - 12th January, 2021
-
Spent time going over my dissertation proposal and the data I will need to collect to make it happen
- What data do I need?
- Qualitative?
- Quantitative?
- How will I collect the data?
- Use secondary resources?
- Collect it first hand?
- With human interaction?
- Without human interaction?
- How will I store the data?
- Local system?
- Cloud?
- Documentation?
- What ethical considerations must I be aware of?
- What data do I need?
Day 9 - 13th January, 2021
- Read the article - https://hackernoon.com/going-from-not-being-able-to-code-to-deep-learning-hero-2ou34fh by Radek Osmulski
- Went in-depth into R Markdown to make a decent final report for my winter assignment - R Markdown Cookbook (bookdown.org)
Day 10 - 14th January, 2021
- Initiated work on https://www.widsconference.org/datathon.html
- Also, I have secretly fallen in love with #R’s pipe operator ;)
Day 11 - 15th January, 2021
- Continued work on the WiDS 2021 Datathon
- Worked on the draft of an article summarizing my favourite tips on working on a data analytics project
Day 12 - 16th January, 2021
- Published an article on @TDataScience summarizing my favorite tips to create a good Data Analytics project - https://towardsdatascience.com/how-to-make-a-data-analytics-project-that-people-want-to-read-47caea306570
- Went through a Tableau crash course tutorial - https://www.youtube.com/watch?v=TPMlZxRRaBQ
Day 13 - 17th January, 2021
- Read through Jeremey Howard’s and team’s Drivetrain approach - Designing great data products – O’Reilly
- Define your objective
- Understand the inputs that you can control - Levers
- Identify the data you need to collect
- Modelling
Day 14 - 18th January, 2021
- Made my first WiDS 2021 challenge kernel public - https://kaggle.com/thedatabeast/wids-2021-tutorial…
- Resources used
- https://machinelearningmastery.com/what-is-data-preparation-in-machine-learning/
- https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- Resources used
- While doing so, brushed up on my rusty skills in data preparation for modelling tasks
- Tried the lazypredict package to compare models, but hit memory issues
Day 15 - 19th January, 2021
Spent 15 mins to chalk out a rough plan for my first project under the #66DaysofData initiative. The idea is to build a simple image classifier and deploy it as a web application.
Hope to start work on it soon and complete it by mid-February.
Day 16 - 20th January, 2021
- Dabbled with the modin package - Works faster than pandas, but works exactly like pandas
- Read the first 2 chapters of Introduction to Statistical Learning
Day 17 - 21st January, 2021
Took a break from any kind of data science learning today.
Instead, I went ahead and dedicated the day for my grad school assignment. Will continue to work on this till I submit it after which I shall return to #66DaysofData
Day 18 - 25th January, 2021
Resumed things after my little break.
Kicked things of with working through the @fastdotai book’s second chapter. Fired up a kernel with under 10 lines of code - https://www.kaggle.com/thedatabeast/fast-ai-resnet18-2-epochs-99-9-accuracy
Not much, but surely a way to start my fast.ai journey!
Day 19 - 26th January
- Heard the Decision Skills Q&A by Cassie Kozyrkov and Jenny Brown
- Making a decision takes effort
- Decision making requires people to be motivated
- Requires collaboration
- Important to delegate decision making appropriately
- Structuring, coordinating and assigning is important to make good decisions
- It is time society begins to treat “decision-making” as a skill just like engineering or singing
- Software and decision-making
- Software can aid in decision making
- It helps transcend the limitations of the human brain
- But, software cannot replace the human in the decision-making process
- A “one-size fits all” approach will not work in decision making software
- More data is not always better. It has to be more “good” data
- Decision making in its purest sense is about processing information
- The most important question data scientists need to be asking stakeholders
- “What would it take to change your mind?”
- Forces the stakeholders to think what exactly they want
- Making a decision takes effort
- Dived into the creation of pipelines in scikit-learn today - 6.1. Pipelines and composite estimators — scikit-learn 0.24.1 documentation (scikit-learn.org)
Break
I have been forced to take a break from 66 Days of Data due to health issues. I will be resuming the campaign soon after revamping my plan for the same.
Day 20 - 17th March, 2021
- Completed SQL Basics from Alex Freberg’s Youtube Channel
- Obtained the 10 days of statistics badge from Hackerrank
Day 21 - 18th March, 2021
- Worked on SQL problems on Hackerrank
- Obtained 2 stars on Hackerrank for SQL
Day 22 - 19th March, 2021
- Continued working through SQL challenges on Hackerrank
- Watched some SQL Intermediate videos by Alex Freberg - Highly recommend this to anyone who wants bite-sized lectures to get started with SQL
Day 23 - 20th March, 2021
- Continued work on SQL
- Focused on joins and unions. Found out that the venn diagram representation of SQL joins makes it so much more easier to visualize the output of a type of join
Day 24 - 21st March, 2021
- Learnt Google Data Studio
- Obtained a certificate from Google Analytics Academy for Introduction to Data Studio
- Worked on a data engineering mini-project based on the course by Karolina Sowsinka
- Link to the repository: ry05/spotify_data_pipeline: A basic data pipeline to learn some data engineering basics (github.com)
- Link to Ms Sowsinka’s Data Engineering Playlist
Day 25 - 22nd March, 2021 to Day 33 - 30th March, 2021
- Dedicated the last full week to the study of An Introduction to Statistical Learning - First Edition
- Completed all chapters except chapter 7
- Made notes that are available at ry05/ISL_Notes: Draft notes taken while reading An Introduction to Statistical Learning (github.com)