GitHub Availability Report: May 2024
In May, we experienced one incident that resulted in degraded performance across GitHub services.
Unstructured data holds valuable information about codebases, organizational best practices, and customer feedback. Here are some ways you can leverage it with RAG, or retrieval-augmented generation.
Whether they’re building a new product or improving a process or feature, developers and IT leaders need data and insights to make informed decisions.
When it comes to software development, this data exists in two ways: unstructured and structured. While structured data follows a specific and predefined format, unstructured data—like email, an audio or visual file, code comment, or commit message—doesn’t. This makes unstructured data hard to organize and interpret, which means teams can miss out on potentially valuable insights.
To make the most of their unstructured data, development teams are turning to retrieval-augmented generation, or RAG, a method for customizing large language models (LLMs). They can use RAG to keep LLMs up to date with organizational knowledge and the latest information available on the web. They can also use RAG and LLMs to surface and extract insights from unstructured data.
GitHub data scientists, Pam Moriarty and Jessica Guo, explain unstructured data’s unique value in software development, and how developers and organizations can use RAG to create greater efficiency and value in the development process.
When it comes to software development, unstructured data includes source code and the context surrounding it, as these sources of information don’t follow a predefined format.
Here are some examples of unstructured data on GitHub:
The same features that make unstructured data valuable also make it hard to analyze.
Unstructured data lacks inherent organization, as it often consists of free-form text, images, or multimedia content.
“Without clear boundaries or predefined formats, extracting meaningful information from unstructured data becomes very challenging,” Guo says.
But LLMs can help to identify complex patterns in unstructured data—especially text. Though not all unstructured data is text, a lot of text is unstructured. And LLMs can help you to analyze it.
“When dealing with ambiguous, semi-structured or unstructured data, LLMs dramatically excel at identifying patterns, sentiments, entities, and topics within text data and uncover valuable insights that might otherwise remain hidden,” Guo explains.
Need a refresher on LLMs? Check out our AI explainers, guides, and best practices > |
Here are a few reasons why developers and IT leaders might consider using RAG-powered LLMs to leverage unstructured data:
Accelerate and deepen understanding of an existing codebase—including its conventions, functions, common issues, and bugs. Understanding and familiarizing yourself with code written by another developer is a persisting challenge for several reasons, including but not limited to: code complexity, use of different coding styles, a lack of documentation, use of legacy code or deprecated libraries and APIs, and the buildup of technical debt from quick fixes and workarounds.
RAG can help to mediate these pain points by enabling developers to ask and receive answers in natural language about a specific codebase. It can also guide developers to relevant documentation or existing solutions.
Accelerated and deepened understanding of a codebase enables junior developers to contribute their first pull request with less onboarding time and senior developers to mitigate live site incidents, even when they’re unfamiliar with the service that’s failing. It also means that legacy code suffering from “code rot” and natural aging can be more quickly modernized and easily maintained.
Unstructured data doesn’t just help to improve development processes. It can also improve product decisions by surfacing user pain points.
Moriarty says, “Structured data might show a user’s decision to upgrade or renew a subscription, or how frequently they use a product or not. While those decisions represent the user’s attitude and feelings toward the product, it’s not a complete representation. Unstructured data allows for more nuanced and qualitative feedback, making for a more complete picture.”
A lot of information and feedback is shared during informal discussions, whether those discussions happen on a call, over email, on social platforms, or in an instant message. From these discussions, decision makers and builders can find helpful feedback to improve a service or product, and understand general public and user sentiment.
Contrary to unstructured data, structured data—like relational databases, Protobuf files, and configuration files—follows a specific and predefined format.
We’re not saying unstructured data is more valuable than structured. But the processes for analyzing structured data are more straightforward: you can use SQL functions to modify the data and traditional statistical methods to understand the relationship between different variables.
That’s not to say AI isn’t used for structured data analysis. “There’s a reason that machine learning, given its predictive power, is and continues to be widespread across industries that use data,” according to Moriarty.
However, “Structured data is often numeric, and numbers are simply easier to analyze for patterns than words are,” Moriarty says. Not to mention that methods for analyzing structured data have been around longer** **than those for analyzing unstructured data: “A longer history with more focus just means there are more established approaches, and more people are familiar with it,” she explains.
That’s why the demand to enhance structured data might seem less urgent, according to Guo. “The potential for transformative impact is significantly greater when applied to unstructured data,” she says.
With RAG, an LLM can use data sources beyond its training data to generate an output.
RAG is a prompting method that uses retrieval—a process for searching for and accessing information—to add more context to a prompt that generates an LLM response.
This method is designed to improve the quality and relevance of an LLM’s outputs. Additional data sources include a vector database, traditional database, or search engine. So, developers who use an enterprise AI tool equipped with RAG can receive AI outputs customized to their organization’s best practices and knowledge, and proprietary data.
We break down these data sources in our RAG explainer, but here’s a quick summary:
And when you’re engaging with GitHub Copilot Chat on GitHub.com or in the IDE, your query or code is transformed into an embedding. Our retrieval service then fetches relevant embeddings from the vector database for the repository you’ve indexed. These embeddings are turned back into text and code when they’re added to the prompt as additional context for the LLM. This entire process leverages unstructured data, even though the retrieval system uses embeddings internally.
Learn about GitHub Copilot Enterprise features >
But wait, why is Markdown considered unstructured data? Though you can use Markdown to format a file, the file itself can contain essentially any kind of data. Think about it this way: how would you put the contents of a Markdown file in a table?
Retrieval also taps into internal search engines. So, if a developer wants to ask a question about a specific repository, they can index the repository and then send their question to GitHub Copilot Chat on GitHub.com. Retrieval uses our internal search engine to find relevant code or text from the indexed files, which are then used by RAG to prompt the LLM for a contextually relevant response.
Stay smart: LLMs can do things they weren’t trained to do, so it’s important to always evaluate and verify their outputs.
As developers improve their productivity and write more code with AI tools like GitHub Copilot, there’ll be even more unstructured data. Not just in the code itself, but also the information used to build, contextualize, maintain, and improve that code.
That means even more data containing rich insights that organizations can surface and leverage, or let sink and disappear.
Developers and IT leaders can use RAG as a tool to help improve their productivity, produce high-quality and consistent code at greater speed, preserve and share information, and increase their understanding of existing codebases, which can impact reduced onboarding time.
With a RAG-powered AI tool, developers and IT leaders can quickly discover, analyze, and evaluate a wealth of unstructured data—simply by asking a question.