How to build a Data Analysis Project, You Ask?

Well instead of talking about it, let's build a project with Python, Pandas, Plotly, and Persistence.

Aug 28, 2024

We will walk through building a data analysis project from scratch, using my Census Analysis Dashboard as a real-world example. By the end of this, you'll hopefully have a roadmap for creating a data analysis project or dashboard.

You can check out the completed dashboard at https://uscensus.streamlit.app/. This interactive web app allows users to explore U.S. demographics at the ZIP code or census tract level. All the code can be found on GitHub here.

Let’s dive in!

1. Define The Problem To Solve

Pretty simple, right? But this step is super important. You need to know what problem to solve before exploring any data. In fact, knowing what data you need would be almost impossible without understanding your problem statement.

At work, the problem to solve is usually given to you by a stakeholder. One thing to do is make sure you understand the issue correctly—essentially, the “why” of the problem or question.

There are many practical use cases for building a Census Analysis Dashboard (marketing campaigns, real estate, healthcare, etc. ), but honestly, the “why” for me was so anyone could see some demographic analysis for a specific ZIP code (mainly me).

Write down your project objectives. They'll guide your decisions throughout the development process.

2. Getting that SWEET Data

This is where your journey starts. Well… it depends. If you are at work, you typically have access to all the data you need, and now that you understand the problem statement, it is pretty easy to get the data you need.

But finding data for your projects can be challenging, sometimes even causing you to give up on the project itself.

My journey for demographic data started by going through the typical dataset sites like Kaggle and Amazon, which led me to the World Data Bank. All this data was great, but I wanted to do it by ZIP Code, so it was no good.

Finally, after about 3 hours of all this, I went to Reddit and found a comment mentioning the mother lode of US demographic information: the U.S. Census API.

Census.gov is the most confusing website, and I firmly believe they did this purposefully. This is the moment I thought of giving up on this project. I wanted to do the API but was considering downloading the dataset instead. But after much frustration, I found the simplest way to make an API call. Here it is:

ZIPCODE = '90014'
YEAR = '2022'
BASE_URL = f"https://api.census.gov/data/{YEAR}/acs/acs5"

params = {
        'get': COLUMN_CODE, 
        'key': CENSUS_DATA_API,
        'for': f'zip code tabulation area:{ZIPCODE}'
}

async with aiohttp.ClientSession(connector=aiohttp.TCPConnector()) as session:
   async with session.get(BASE_URL, params=params) as response:
            if response.status == 200:
                data = await response.json()

3. Diving into the Trenches of Data

Once you get the data, it is time to explore and prepare. Usually, there is a data dictionary if you are in an imaginary company; otherwise, you annoy the seniors and SMEs until you understand the data while exploring the dataset with a bunch of print/SELECT statements.

Luckily for us, the US Census provided a data dictionary somewhere here. The only problem is that the dictionary has information of around 28,000 columns. There was no SME here, which meant just using CMD + F to explore the data dictionary.

I faced an issue here; there were so many similar columns and so many others that I thought I should include. This was because my “why“ was not clear enough. I told myself I wanted to see demographic information but did not specify what exactly I wanted to see and why.

Since this is a simple project just for myself, I decided to go with some basic demographic information. I created a JSON file to map the cryptic Census variable codes I needed to human-readable names. Once I had it, getting the data I needed from the API and putting it into pandas data frames was pretty easy. This, by far, took the longest time since I had to gather all the data and clean it correctly for the analysis.

4. Analysis Time

This is pretty straightforward since it involves applying your skills and learning to solve the problem with the data. To start, I wanted to see population distribution by age and sex, employment statistics, and some others. Since I had already done everything in Python so far, I continued building with Pandas.

I am not the best at pandas, so I had to google and read through the documentation a lot to pivot and concat data frames and do some simple analysis.

Side Note: what the hell is pandas.melt()? When did we start massaging data

I hope you understand this is a crucial step driven by our “why” (problem statement) and all the previous decisions. It gets pretty easier the more you practice and tackle complex analysis.

5. Making it look Sexy

Congratulations! You have finished most of the work. I usually take a break, watch a movie, or get a couple of Warzone games in unless I fall short of meeting a deadline.

You have all the analysis, but showing plain Excel without charts these days will not go over well with the stakeholders, especially if you want to get that dough and some praise. So let’s spice it up.

I’m sure there is a better way to handle things, but since that was easier, I created some extra data frames to visualize certain analyses better. I also chose to make it a streamlit app because it is convenient and very accessible to anyone.

One important note: It doesn’t matter what tool you use; how you display the information matters most. Don’t use some cool new visualization chart you found if the person who has to look at it is going to have a hard time understanding. Also, I used Plotly here since it’s cooler than mAtPlOtLiB. Fight me.

Your Turn: Dive In or Contribute!

Now that you've seen the process of building a data analysis project from start to finish, it's your turn to get your hands dirty! Start thinking about your own data analysis project right now and get on with it.

If not, here are some ways you can practice and contribute:

Clone the Census Analysis Dashboard repository and run it locally.
Explore the code and see how the concepts we discussed are implemented.
Think of a feature you'd like to add or an improvement you could make. Maybe additional visualizations? Or analysis of trends over time?
Create a fork of the repository, make your changes, and submit a pull request. This is a great way to get real-world experience with collaborative coding!

Remember, the key is to start building, keep learning, and don't be afraid to ask questions or seek help.

That is it from me! I hope this exploration was helpful in some way!

If you found value in this article, please share it with someone who might also benefit from it. Your support helps spread knowledge and inspires more content like this. Let's keep the conversation going—share your thoughts and experiences below!

TheBadCoders

Discussion about this post