Introducing Sourcesailor

Posted on Apr 30, 2024, 10 minute read

Diving into an unfamiliar codebase can be as daunting as decoding a treasure map. SourceSailor, a command-line tool, is here to guide you through these murky waters. Whether you’re a seasoned developer or a novice, SourceSailor provides clear, structured insights into any codebase, ensuring that you understand every line of code and every function’s purpose.

While this post dedicates a section about what SourceSailor does, I encourage you to download the tool from NPM and run against SourceSailor’s own codebase to see the magic in action. You will know what’s happening in this tool when you run it. But as I said earlier, answering why is more important to know what’s happening in your code, and it’s often not automated and thus not possible to answer by any tool. So this post tries to tell a story of how and why SourceSailor came to be. By the way if you want to know the feature list of SourceSailor, you can check it here.

The Genesis: First Commit and Early Vision

When I started my previous job, I was given a big monolith where I am to add a companion functionality for the microservice I have developed and deployed separately, there was little introduction of the codebase, and they didn’t have any documentation apart from release notes. I realized this frustration of navigating an established codebase without the documentation can be the problem for the other developers when I am handing over one of the other crucial service to a junior member of the team, I have seen them struggling to understand the undocumented part.

On the other hand we knew LLMs can translate natural languages into code and vice versa. There are a lot of tools that does the former (aka generating code from natural language), but very few which can do the latter. And probably none which can give the same output I was expecting. So I decided to build a tool that can help me understand the codebase I am working with, or the codebases I have starred in GitHub couple of years ago but never got time to understand them.

Thus started SourceSailor (which was one of the name suggested to me by either GPT 4 Turbo or Gemini when I was starting with dabbling with the idea. I don’t remember which one suggested me this name, but I liked it and decided to go with it).

The process of understanding the codebase.

To grasp a new codebase effectively, we follow a structured approach:

Overall Structure: Identify whether it’s a monorepo or a single project. This dictates how we explore further.
Dependencies: Analyze relevant files (e.g., package.json, go.mod) to understand the project’s external dependencies and their versions.
Entry Point: Locate and examine the code’s entry point, which sets the stage for understanding the execution flow.
Monorepo Specifics (if applicable): Determine the roles of different directories within the monorepo before analyzing each subcomponent independently using steps 2 and 3.

By following these steps, we gain a comprehensive understanding of the codebase’s organization and dependencies, paving the way for efficient exploration and analysis.

Getting Started

I have decided to develop it as CLI as it integrates seamlessly into a developer’s workflow. It feels as handy as having cheat codes for your projects, allowing you to execute analysis within your existing setup.

Go is my go-to language for crafting CLIs—it’s efficient and superb for distribution. That said, while I started developing this tool as a Go CLI, I hit a snag with tree-sitter support, which is a bit undercooked in Go. In fact, crafting a Mac binary was a no-go (Pun purely intended.) because of these tree-sitter limitations.

Continuing with NodeJS and yargs after encountering tree-sitter limitations. The first version I rolled out has a single command with a basic prompt which reads the project directory, probing whether it was a monorepo while also pinpointing dependencies and entry files. It was a decent launchpad, but there was a snag—the prompt itself was a bit murky, and the data fed into it wasn’t quite up to snuff for making precise inferences. This experience hammered home a crucial insight…

With State Of The Art frontier models like GPT4 or Claude OPUS or Gemini 1.5, the data provided to it for inference, the quality of data you provide is key to nailing accurate inferences.

Structuring the Deck: Establishing the Foundation

In the early stages of development, I focused on refining SourceSailor’s directory structure. My objective was to replicate the functionality of the traditional tree command in a JSON format suitable for our tool. After experimenting with various JSON schemas and adjusting through trial and error, I achieved an implementation that not only looks clean but also integrates flawlessly with the rest of the tool’s architecture, ensuring precise mapping of every directory and file.

During development, a critical decision arose: whether to utilize function calling or rely solely on the output from prompts. I quickly recognized the importance of function calling for dynamically incorporating AI insights into the application’s decision-making processes. Function calling allows us to use responses from models like GPT-4 to trigger specific actions within the tool, such as loops or conditional statements, ensuring that every step is based on the most relevant and current data.

Additionally, while OpenAI’s models allow for JSON outputs without predefined schemas, which adds convenience, this can sometimes lead to inconsistent results. To mitigate this, I implemented a specific JSON schema for prompts to ensure consistency and accuracy in the data received. This approach not only helps in current operations but also prepares us for seamless integration with future models that might require explicit schemas.

By embedding function calling from the beginning, SourceSailor efficiently discerns whether it’s handling a single project or a monorepo, thereby optimizing the analysis of dependencies and project structures and facilitating the integration of new models that support function calling.

Navigating Through Changes: Major Updates and Enhancements

Initially, I explored using tree sitters to enhance code structure analysis but realized that most codebases are compact enough and can be overkill for a model that can handle a large context window. The experiment gave me following insights:

Most codebases are surprisingly compact, and with tools like GPT-4 Turbo, which can handle up to 128k tokens, parsing them is straightforward. Looking ahead to integrating models like Claude or Gemini 1.5, we’ll have even more capacity if needed.
My use of .gitignore effectively trims down unnecessary files, reducing the overall size of what needs to be analyzed. There was a hiccup where .gitignore in subdirectories wasn’t recognized, but I’ve since resolved that issue.
More data usually translates to clearer insights, so I opted to analyze the full scope of the files.

With these considerations in mind, I decided to rely on passing entire files for analysis rather than just depending on tree-sitter. Tree-sitter only becomes necessary if, after all the exclusions, a codebase still exceeds the 128k token threshold.

Chosing the right words in the prompt and adding a personality is a huge game changer in both the effectiveness and enjoyment of the results. I knew it and this is one more time I experienced it. Most of the time the prompts were asking the model to guess the things, as soon as I asked the model to get the things, the accuracy increased.

Prompt engineering benefits significantly from personalized responses. Infusing a senior engineer’s personality into prompts enhanced both accuracy and enjoyment, demonstrating the impact of tailored interactions.

At this point, SourceSailor was explaining some smaller codebases quite well. This is where I call the version which works.¹

Charting New Territories: Adding Commands and Refactoring

With the foundational aspects of SourceSailor refined, I focused on enhancing its robustness through several functional upgrades. A significant enhancement was enabling the seamless switching between different AI models without needing a complete setup restart, boosting the tool’s agility and responsiveness to user needs.

Initially, I prioritized perfecting SourceSailor’s functionality for single-project repositories before addressing the complexities of monorepos. This methodical approach ensured a solid foundation, allowing for straightforward scaling to manage multiple directories within monorepos.

Another crucial feature was implementing streaming capabilities, essential in the AI-driven era of chat-based LLM interfaces. SourceSailor’s CLI effectively streams results, ensuring dynamic and interactive user experiences.

Lastly, the decision to shift away from Go due to tree-sitter limitations led to significant changes. SourceSailor now leverages tree-sitter to parse only the Abstract Syntax Tree (AST) instead of entire code files, enhancing efficiency by focusing on structural elements rather than the entire codebase.

These enhancements equip SourceSailor with all necessary functionalities, poised for its next phase of deployment.

Preparing for Voyage: Final Preparations Before Releases

Here’s a streamlined version of the section:

Integrating a Textual User Interface (TUI) with progress bars transformed SourceSailor’s user experience from static to dynamic, making the analysis process visually engaging and informative. Users can now see real-time progress, eliminating any guesswork about the tool’s activity.

To enhance flexibility, I revised the initial setup to include configurable settings, allowing users to designate storage locations for analysis files. This modification prevents clutter in the codebase and maintains the scalability of SourceSailor’s core functions.

Addressing file management, I expanded the capabilities of the .gitignore feature to exclude unnecessary files more effectively, refining the focus on essential data and boosting both efficiency and cost-effectiveness. Additionally, users can now export directory structures with or without file content, applying the same filters used during analysis. This feature is particularly useful for developers who require the JSON structure of directories for further processing.

Another significant addition was the report generation feature, which automatically updates project READMEs with insights from each analysis. This not only saves time but ensures that documentation remains as current as the code itself, greatly benefiting project teams by maintaining accurate and up-to-date project descriptions.

With these enhancements, SourceSailor is fully equipped for public release. I’ve even utilized its capabilities to draft the Installation and Usage section of its Installation and Usage section of its own README. And why stop there? I even penned the entire README of JobJigsaw using the analysis and report generation capabilities of SourceSailor. Now, it’s time to launch this ship onto the vast ocean of NPM.

Maiden Voyage: The Release Day

Releasing SourceSailor on NPM was an enlightening journey, and it led to some valuable insights that I’ll delve into in a future “TIL” (Today I Learned) section. Here’s a snapshot of the process:

Automating the Deployment: I set up a GitHub Action to automate deployment to NPM. This action kicks in whenever I update the version in package.json, ensuring that new changes are promptly available to users.
Syncing Information: I made sure that the codebase, version numbers, and documentation stayed consistent between the GitHub README and the NPM page. This not only maintains professionalism but ensures users always have the most current information.

With SourceSailor now successfully launched on NPM, the focus shifts to refining and expanding its capabilities to better serve the developer community.

The Horizon: Future Plans for SourceSailor

This is just the first version of SourceSailor, and there’s a vast ocean of possibilities to explore. Here are a few key areas I plan to focus on in the coming months:

Introduce user prompt. I want a way to the user to tell SourceSailor what languages do they know and what’s their proficiency in each of them, this will help the analysis and reports to be more understandable by the user.
Add Claude and Gemini1.5 in the mix. At this point I am not planning to add any local models as I don’t found them to work well with large context window. If necessary, it can either be done via OpenRouter.
A very highly speculative item is to use the reports generated as the source of RAG and Question Answering. I might not commit it now as I don’t have much practical experience in RAG, but I am not discarding it entirely…

In Star Wars episode 4. Luke turns off the Navicomputer and trusts the force to fire the shot that destroys the death star, because he trusts on the force. When you are trusting the force of LLMs, you don’t have to turn off your navicomputer. Instead, let SourceSailor help you to destroy the death star of the ignorance, fear and confusion. I hope it helps you to chart the new codebases with ease.

With that

May the force be with you…

Based on Chris Nova’s book Hacking Capitalism where there is a description of phased approach of building a software which is called “Make it work well for you”, (Summarization from Perplexity). At this point I am at Make it work phase… ↩︎