How to Organize Your Projects on GitHub


Seo Computer” by Serpstat/ CC0 1.0

While this post is mainly for my beginner friends, I’m sure there are some tips that anyone can use!

Hello again, everyone!

So you’ve been learning some coding and maybe even machine learning and now you are ready to put your knowledge into action! Projects are excellent ways to develop and solidify new skills, as well as show potential employers your capabilities! With that said, how you structure and showcase your work is important. If a recruiter or interviewer finds your github projects to be sloppy or not user-friendly, it wont matter how awesome your new anime recommender system with an integrated web app is (and yes, it’s pretty awesome!) So, today, I want to delve into the significance of organizing data science projects on GitHub, a platform that’s become an essential tool for many of us in this field.

Why is Organization So Crucial?

  1. Navigational Ease: A clear project structure ensures that anyone who stumbles upon your repository can easily navigate and understand your work. It’s like giving them a roadmap to your thought process.
  2. Reproducibility Matters: One of the cornerstones of data science is ensuring others can replicate your results. A well-organized repository allows others to know where you got your information including datasets from, so they can recreate your project. That said, obviously, step back before committing anything and consider potential security risks revolving around what type of data and code you are willing and wanting to share.
  3. Showcasing Professionalism: Your GitHub repository is a reflection of your dedication and commitment. A neat and logical layout speaks volumes about your attention to detail.

Key Components for a Structured Layout:

  • README.md: This is your project’s introduction. It should provide an overview, its objectives, the datasets used, and a guide on how to set everything up. Someone once told me, that this is the place where non-technical stakeholders can look at what you are doing and get a clear grasp of your project.
  • .gitignore: This is an essential file that specifies which files or directories should be excluded from version control. This ensures that sensitive data, cache files, or other non-essential files don’t get uploaded to GitHub. Tip: Make sure to update this right away before you start making your commits!
  • Data Folder: Every project has its unique datasets. Organizing them neatly ensures clarity and maintains data integrity.
  • Notebooks Folder: If you code in a notebook (I like jupyter notebooks and colab) place them in their own folder. They should combine code, narrative, and visualizations. I personally separate mine out between EDA and Modeling.
  • Scripts Folder: For more extensive projects, having separate scripts ensures modularity and easier debugging. This can include, but is not limited to, modeling scripts, app scripts, automation scripts, etc.
  • Results Folder: This is where your final outputs live. This might contain visualizations, tables, and modeling results that you wan’t to display.
  • Acknowledgments: Always give credit where it’s due. Whether it’s data sources or external tools, acknowledging them builds trust and aids in reproducibility.

Data projects are not only ways to develop skills, but also ways to show off how well you can organize and present your work. It pays off to take the extra time and make your repos look professional.

Happy githubbing! (Is that a word?)

Leave a comment