Pardon the mess

We’ve made significant changes and improvements to the product since these docs were written. You’ll find that we’ve made the workflow in this playbook faster and easier.

We haven’t updated this particular playbook just yet, so it might not be very useful. Go to docs home to see the latest on how Sequin works.

Building on GitHub with Retool

We’re Sequin. We let you build apps and workflows on top of APIs like GitHub using just SQL and Postgres. We sync changes in the API to your database and vice-versa. We’re also excited about all the new ML/AI tooling that’s available for developers. One tool, embeddings, lets you search, cluster, and categorize your data in a whole new way. With your API data in Postgres, it’s easy to leverage embeddings to create powerful apps. Sign up to get GitHub in your Postgres database and follow along in the guide below!

In this post, I’ll show you how to build a tool in Retool. In the process, we’ll explore some neat tools including:

  • Retool’s Workflow product
  • Retool’s App builder
  • OpenAI’s API
  • Embeddings

You’ll build an embeddings search tool for your GitHub PRs and commits. With embeddings, you’re no longer restricted to simple string comparison – you can build a search function that does advanced semantic comparison between your search query and your data.

Curious about embeddings and what they’re used for? See our Salesforce blog post on the topic.

In addition to the search tool demonstrated here, you can also build on this foundation to do cluster analysis. For example, you can determine how many commits and/or PRs in the last month were related to bugfixes vs new features. I’ll touch on that at the end of this post.

This post assumes you already have GitHub setup with Sequin to sync GitHub objects to your Postgres database. You should also have an OpenAI account setup.

Overview

Your application will revolve around a Postgres database. Sequin will sync GitHub commits and PRs to your database. You’ll generate OpenAI embeddings for each of the GitHub objects you want to perform searches against.

To perform a search, the user will type in a query, such as “embed Elixir struct into Postgres jsonb.” Then, you can take that query, turn it into an embedding, and compare its relatedness to the embeddings of all the commit messages and PR bodies.

You’ll also need to setup a workflow to ensure new or updated PRs and commits that Sequin syncs to your database have their embeddings generated as well.

The architecture will look like this:

Architecture of the application, including a Retool Workflow and a Retool App.

On the left, Sequin will continuously sync GitHub to your database. It will also fire off webhooks that you can use to trigger a Retool Workflow, which will update the embeddings for new and updated records.

The right hand side of the diagram shows the user-facing Retool App that will perform search queries against the database.

To generate embeddings, you’ll want to rely on a third-party vendor. You can use APIs like OpenAI’s to generate them.

Prepare your database

To prepare your database, first add the pg_vector extension 1:

create extension vector;

Create a separate schema, github_embedding, for your embedding data 2. In your queries, you’ll join your embedding tables to your GitHub tables.

Here’s an example of creating an embedding table for GitHub commits:

create table github_embedding.commit (
  id text references github.commit(id) on delete cascade,
  embedding vector(1536) not null
)

In this post, I’ll show you how to use OpenAI’s text-embedding-ada-002 model. That model generates embeddings with 1536 dimensions, hence the 1536 parameter above.

Generate embeddings on insert or update

You’ll first setup your app to generate embeddings for GitHub records as they’re inserted or updated. Then, I’ll show you how to backfill embeddings for your existing records.

You have two options for finding out about new or updated GitHub records in your database.

You can use Postgres’ listen/notify protocol. It’s fast to get started with and works great. But, notify events are ephemeral, so delivery is at-most-once. That means there’s a risk of missing notifications, and therefore of there being holes in your data.

Along with a sync to Postgres, Sequin provisions an event stream for you. Sequin will publish events to a serverless Kafka stream associated with your sync. Sequin will publish events like “GitHub Pull Request deleted” or “GtiHub Commit upserted.”

You can configure connectors to this Kafka stream. One of the connectors is HTTP, which POSTs events to an endpoint you choose. We can use the HTTP connector to have events sent to a webhook endpoint over on Retool. That webhook endpoint can trigger a workflow, which we can use to update the embeddings table. Notably, unlike listen/notify, the Kafka stream is durable and the HTTP connector will retry on failure, meaning we can get at-least-once delivery.

First, over on Retool, create a new Workflow. In the startTrigger block, click “Edit triggers” and then toggle on “Webhook”:

Start trigger block in a Retool Workflow, option for webhook.

Selecting the “Webhook” toggle on the left rail will expand a drawer that reveals the endpoint URL for this Workflow. Copy that endpoint:

The webhook configuration panel of the startTrigger block.

Before leaving Retool, be sure to save your changes and click the “Deploy” button to ensure the Webook endpoint is running.

Then, over on Sequin, configure the “HTTP Connector” for your GitHub sync. For the destination “HTTP URL,” enter your Workflow’s webhook endpoint:

The HTTP/sink connector configuration in Sequin's console.

Click “Save.” At this point, Sequin will send webhooks to your Retool Workflow every time it captures an insert or update to a GitHub Pull Request or Commit. Make some changes over in GitHub, and you’ll see those events kick off runs in your Retool Workflow:

List of recent workflow runs in Retool Workflow console. There are evidently a few webhooks that have been received.

Now that Sequin is sending your workflow webhooks, you can compose a step in your workflow that takes the event and generates an embedding for the associated record. Last, you can create a step that upserts the embedding into your database.

Add a new block to your Retool Workflow. This block will be a Resource Query to OpenAI. Configure your block like this:

Retool Workflow Block for OpenAI, showing how to configure a call to the /embeddings endpoint.

For Operation, select POST /embeddings. For the request body, set the input to the strings you want to generate the embeddings off of. In this example, we’re generating embeddings for a Pull Request, and using its title and body.

For the model, use OpenAI’s text-embedding-ada-002.

Press the “Play” icon and verify that your request works. You should see a JSON result from OpenAI that contains a list of a bunch of vectors.

Now that you have the embeddings, you’ll upsert them into your embedding table (see the create table statement above). I recommend you use Retool’s GUI mode to compose the upsert query. Not only is it easier to use than SQL mode for upserts. But getting vectors/arrays to work in Retool’s SQL mode is a little tricky.

Here’s how to configure your upsert:

Block that configures a SQL upsert into the embeddings table.

Be sure you select the right upsert “Action type,” as shown above. You’ll use startTrigger.data.id for the id column and query1.data.data[0].embedding for the embedding column (assuming you left your OpenAI query to the default query1).

At this point, you’re ready to test things end-to-end. You can start by using a test event. In “Run history,” select an event then select startTrigger. Click the “Data” tab. Scroll down to the bottom and you’ll see “Use as example JSON.” Click that. Now, you should see “Test JSON parameters” populated for your startTrigger.

With test JSON parameters populated, you can step through each block: first, run the query to OpenAI and verify you get back a list of embeddings. 3 Next, run your upsert. Assuming that succeeds, check your database and verify a row was indeed inserted.

Now, deploy your Retool workflow. At this point, when a GitHub record is inserted or updated, your workflow will get triggered and populate the associated embedding record. Try it! Head over to GitHub and open or edit a PR. Just a few seconds afterwards, you should see the run get triggered over in Retool.

With your listener in place, the next step is to backfill all the records with null values for embedding in the database.

Backfill the embedding column for existing records

You have two primary options for backfilling the embedding column:

Create a batch job

You can write a one-off batch job that paginates through your table and kicks off API calls to fetch the embeddings for each record.

You can paginate through each table like this 4:

select id, body, title from github.pull_request order by id asc limit 1000 offset 0;

For each record you grab from the database, you can run through the same pipeline: fetch the embeddings, then upsert into the database.

This would mean creating a different workflow inside or outside of Retool, though. Instead, you can have Sequin perform the backfill for you.

Use a Sequin job

Alternatively, you can have Sequin do the record pagination and collection part for you. This will let you use your existing event handling code to backfill your table.

You can kick-off a backfill of your events stream via the Sequin console. Sequin will paginate your Postgres tables and fill your stream with events that have the same shape as upsert events:

{ "id": "1245638131", "collection": "commit", { "data": [] } }

To kick-off the backfill, Sequin will ask you which topic to publish these events to. We recommend using a distinct topic for these events, such as github.backfills.pull_request_embeddings. So, before kicking off the backfill, be sure to update your Webhook sink in the Sequin console to subscribe to the topic. That will ensure that events published to this topic get sent to your Retool workflow.

After the backfill has completed, you’ll have embeddings for all your desired GitHub objects!

Create a Postgres query for finding matches

With your embeddings setup in Postgres, you’re ready to create a mechanism for querying them.

Supabase has a great post on embeddings in Postgres. I’ve adapted their similarity query below. You can use the cosine distance operator (<=>) provided by pg_vector to determine similarity. Here’s a query that grabs a list of pull_requests over a match_threshold, ordered by most similar to least similar:

select
  pull_request.id,
  pull_request.title,
  pull_request.body,
  1 - (embedding_pull_request.embedding <=> {{searchEmbedding.value}}) as similarity
from github_sequin.pull_request as pull_request
join github_embedding_sequin.pull_request as embedding_pull_request on pull_request.id = embedding_pull_request.id
-- match threshold set to 0.75, you can change it
where 1 - (embedding_pull_request.embedding <=> {{searchEmbedding.value}}) > 0.75
order by similarity desc
-- match count set to 5, you can change it
limit 5;

You might consider wrapping this into a Postgres function and storing it in your database

create or replace function match_pull_requests(query_embedding vector(1536), match_threshold float, match_count int)
  returns table(
    id bigint,
    title text,
    body text,
    similarity float)
  language sql
  stable
  as $$
  select
    pull_request.id,
    pull_request.title,
    pull_request.body,
    1 -(embedding_pull_request.embedding <=> query_embedding) as similarity
  from
    github_sequin.pull_request as pull_request
    join github_embedding_sequin.pull_request as embedding_pull_request on pull_request.id = embedding_pull_request.id
    -- match threshold
  where
    1 -(embedding_pull_request.embedding <=> query_embedding) > match_count
  order by
    similarity desc
    -- match count
  limit match_threshold;
$$;

Build the tool

With your data model and search function squared away, you can build your tool. It should have a table for results and a search bar.

Below is a simple example of this tool. Here’s a demonstration of a search for Pull Requests that mention “serialize and deserialize structs into jsonb ecto”:

Example Retool App for searching your GitHub PRs and commits. Displays a table of results, with a search bar up top.

On the left, we see the list of the top 5 PRs that matched, sorted by similarity descending. On the right is a preview of the PR that you selected.

Note that this is not a literal string match. The search refers to the “serialize and deserialize errors,” but the PR contains serializes/deserializes. The PR also doesn’t mention jsonb, just JSON.

Because of embeddings, we found the exact PR we were looking for, and with only a vague idea of what we were looking for!

To build a tool like this, drop in a basic table, a search bar, and a search button. Then, you’ll compose your queries and variables. Here will be the flow:

  • Users can enter a search query into the app and press “Search.”
  • Clicking “Search” will fire off getQueryEmbedding (below), using OpenAI to convert the search input into an embedding.
  • When getQueryEmbedding returns, it will set the value of the variable queryEmbedding (below).
  • When the variable of queryEmbedding changes, the Postgres query searchPRs (below) runs.
  • Finally, searchPRs does a similarity match between the embedding in the search query and the embeddings for all your GitHub Pull Requests stored in your database. It returns the most similar, rendering them in the table for your user to see.

Starting from the database and working up:

searchPRs

First, create a new variable, queryEmbedding, for storing the embedding of the search query.

Then, create a new Postgres query called searchPRs. The body of the query will look like this:

select
  pull_request.id,
  pull_request.created_at,
  pull_request.title,
  coalesce(nullif(pull_request.body, ''), pull_request.title) as body,
  1 - (embedding_pull_request.embedding <=> {{queryEmbedding.value}}::vector) as similarity
from github_sequin.pull_request as pull_request
join github_embedding_sequin.pull_request as embedding_pull_request on pull_request.id = embedding_pull_request.id
-- match threshold set to 0.75, you can change it
where 1 - (embedding_pull_request.embedding <=> {{queryEmbedding.value}}::vector) > 0.75
order by similarity desc
-- match count set to 5, you can change it
limit 5;

You can tweak both the match threshold and the match count to your liking.

Note that the queryEmbedding is being cast to vector (::vector) on the Postgres side. That’s because Retool doesn’t fully support Postgres vectors just yet. So, as you’ll see, we’re going to pass the database vectors as strings and have it cast the vectors into the right type.

Set this query to Run query automatically when inputs change.

getQueryEmbedding

Now, create a new query using the OpenAI adapter. This will take the search input and turn it into an embedding, using the same functionality you used earlier to generate embeddings for your GitHub Pull Requests:

JavaScript function that fetches the query embedding from OpenAI's API using Retool's connector.

In Request Body, set the input to the value of the search field in your app.

In Transform results, you need to convert the array of floats into a string literal. As noted earlier, Retool’s Postgres adapter doesn’t support vectors. So, convert it to a string, and then the searchPRs query will turn the string into a vector in the database (::vector).

Finally, add an Event Handler. Set the value of the queryEmbedding variable to the output of this query.

With these queries in place, your app is wired up and ready to go!

Conclusion

Once you get a taste of embeddings, it’s hard to go back to search that’s restricted to only literal matches. Between commits, pull requests, and issues, there’s a lot to sift through when you’re looking for something specific. Embeddings help you find precisely what you want without being precise.

But search is only one way you can use embeddings to build tooling on top of your GitHub data. You can also use embeddings to perform analysis on code that’s been committed, like the ratio of bugs to new features. Or to surface the most critical-seeming commits and PRs that have been pushed in the last few days.

To get started building tools with Retool and embeddings on your GitHub data, give Sequin a spin in a free trial!

Footnotes

  1. pg_vector is included in most of the latest distributions of Postgres.

    If you’re on AWS RDS, be sure you upgrade to Postgres 15.2+ to get access to the vector extension.

  2. You can mix and match fields from different tables to generate embeddings. To start, you can keep it simple and generate embeddings that correspond to a single GitHub object. For most objects, you’ll probably choose to create an embedding for just one or two fields. For example, you don’t need to create an embedding for the whole Pull Request object, just the title and body fields. You can concatenate the two fields together into a newline-separated string, and generate the embedding on that.

    In the future, you can blend more fields or objects together to let you build on your data in novel ways.

  3. Embeddings are difficult to “validate” just by looking at them – you can accidentally send OpenAI a meaningless blank string and they’ll still return a list of vectors. So, be sure to validate that all the variables in your query are properly populated. With test data in place, you can click on variables and Retool will display a popover of the value of the variable.

  4. Normally a pagination strategy like this wouldn’t be safe unless IDs were auto-incrementing. But this will work fine in all situations, because we don’t care if we miss records that are inserted mid-pagination — those are being handled by our event handler above!

Was this page helpful?