Cautiously Configuring Copilot

I've written this article for the code.kiwi.com blog.

Security folks do not like Copilot for several reasons. One of the main concerns is the leakage of secrets. Basically, we are searching for the configuration that would prevent this - adhering to the principle of least privilege: third parties do not need our secrets.

Conclusion (or TL: DR;)

To keep things short for a busy professional who’ll take my words at face value – I think this can be done.

Look at the config options for the editor extensions – there is a setting that disables Copilot on specific files. I’d recommend disallowing all and allowing only the languages you are using.

It is good practice to separate configuration and implementation, you should be covered. Yes, if you’d hardcode the secrets into the source code, then they will be sent to Microsoft, GitHub, and OpenAI servers.

Also, just to be sure, there is a telemetry server at copilot-telemetry.githubusercontent.com. While I haven’t observed any sensitive data traveling this way, it is something to keep in mind if you want to “opt-out” using firewall means.

The plugins can be configured to use an HTTP proxy – this would enable you to intercept everything that gets sent and try to detect and drop the requests with secrets (using, e.g., yelp’s project).

Now let’s take a look at how Copilot actually works and how exactly it is sending our data.

Copilot Plugins

I suspect that most developers will interact with Copilot using plugins for IDEs. There’s still the option of editing in GitHub web editors with Copilot enabled, I won’t cover that here.

There are four plugins (as of 6th March 2023):

  • Neovim/Vim
  • Visual Studio Code
  • Visual Studio
  • JetBrains plugin

The core of the plugin is the Copilot agent, written in javascript and sprinkled with web assembly. This agent is responsible for communicating with the AI model, which is provided “as-a-service” over the web. Just to be absolutely clear: the model is not present in the plugin.

The agent code is minimized at least (perhaps even obfuscated). It is ready to be run using node.js.

There were some attempts at reverse-engineering this agent; I’ll mention thakkarparth007’s attempt (link to the repository).

I realized very quickly that I was not interested in this javascript reversing route, so I opted for “dynamic analysis” instead (just look at what requests it produces).

Lab Setup

I’ll be doing my experiments with Neovim plugin as it’s the simplest.

  • First, prepare a new Ubuntu LTS virtual machine
  • Get Copilot for Neovim 0.6+
  • Get the extension from GitHub
  • Install with: git clone https://github.com/github/Copilot.vim.git, ~/.config/nvim/pack/github/start/Copilot.vim
  • install Node.js v16+
  • Config Copilot to use the proxy, disable TLS checks -> open ~/.config/nvim/init.vim, enter:
let g:Copilot_proxy = 192.168.100.1:8080
let g:Copilot_proxy_strict_ssl = v:false

I am fortunate that MS users often find themselves behind a “corporate proxy” – something that will enable us to see what is actually sent. I use the BurpSuite proxy, but any such proxy will do.

So, I am able to sit as MITM between the Copilot and its servers. Firewall configuration ensures that no other requests go out (I don’t want to explain iptables here), so there is no other side channel that the Copilot can use.

API Calls

Start-up

A tuple of calls to (with two different UUIDs): http://dc.services.visualstudio.com:443/api/profiles/<uuid>/appId

Each call returns a singular UUID. No idea what’s the purpose of these API calls.

Enrolling

First, Copilot does a GitHub device login (root github.com), with an OAuth flow. Keeping this part short

  • POST /login/device/code with client id and scope in body call, the response contains device code and user code to be entered at https://github.com/login/device. Expires in 900s -> 15min and the polling interval is set to 5s.
  • Periodically checking POST /login/oauth/access_token with client id, device code and the grant type of urn:ietf:params:oauth:grant-type:device_code
  • Once the user authenticates and authorizes the app, this same endpoint returns the access token and the token type (usually bearer)

So we are logged in to github.com. The next step is to verify the Copilot license validity (root api.github.com):

  • GET /user to get the user
  • GET /Copilot_internal/v2/token to get the user token (no body or params) – stored at ~/.config/github-Copilot/hosts.json

You can see the integration at https://github.com/settings/connections/applications/<client_id>

For some reason, a notification is sent to POST /Copilot_internal/notification

Telemetry

POST to copilot-telemetry.githubusercontent.com/telemetry

Seems pretty standard. I haven’t found anything suspicious in there, of course. They log user ids, session ids, your operating system, editor, versions, and whatnot.

Here are the events I got:

agent/auth.new_login
agent/auth.new_token
agent/completion.alreadyInDocument
agent/completion.finishReason
agent/ghostText.accepted
agent/ghostText.canceled
agent/ghostText.cyclingPerformance
agent/ghostText.empty
agent/ghostText.issued
agent/ghostText.performance
agent/ghostText.produced
agent/ghostText.rejected
agent/ghostText.shown
agent/ghostText.shownFromCache
agent/ghostText.stillInCode
agent/networking.cancelRequest
agent/request.response
agent/request.sent

In some versions of Copilot, this might be richer, as hinted by thakkarparth007’s research, but in my current config, I didn’t see anything suspicious in the telemetry.

Completing

Starting with a hello-world in Python to get the hang of Copilot flows. It seems that this data is sent to: POST copilot-proxy.githubusercontent.com/v1/engines/Copilot-codex/completions

Let’s say I forgot a shebang, so I inserted a new line and #! at the start of the file:

{
  "prompt": "# Path: main.py\n#!",
  "suffix": "if __name__ == \"__main__\":\n    print(\"Hello World!\")\n\n",
  "max_tokens": 500,
  "temperature": 0,
  "top_p": 1,
  "n": 1,
  "stop": [
    "\n"
  ],
  "stream": true,
  "extra": {
    "language": "python",
    "next_indent": 0,
    "trim_by_indentation": true
  }
}

Here is the suggestion in the response body:

data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":"/","index":0,"finish_reason":null,"logprobs":null}]}
data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":"usr","index":0,"finish_reason":null,"logprobs":null}]}
data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":"/bin","index":0,"finish_reason":null,"logprobs":null}]}
data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":"/env","index":0,"finish_reason":null,"logprobs":null}]}
data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":" python","index":0,"finish_reason":null,"logprobs":null}]}
data: {"id":"cmpl-6otEFIevL5BkdAusEEkkVjA6AIdTj","model":"cushman-ml","created":1677586719,"choices":[{"text":"3","index":0,"finish_reason":"stop","logprobs":null}]}
data: [DONE]

You can see in the text part that it suggests adding /usr/bin/env python3 which is exactly what I want in this case.

Also, shoutout to all hackers – this is the place you might like to call for fuzzing, prompt engineering, and other shenanigans.

There is a rate limit, though – I was able to get in around 50 requests before I got limited:

{
  "error":{
    "code":"429",
    "message":"Requests to the Create a completion from a chosen model Operation under OpenAI Language Model Instance API have exceeded call rate limit of your current OpenAI S0 pricing tier. 
              Please retry after 2 seconds. Please contact Azure support service if you would like to further increase the default rate limit."
  }
}

After two seconds, I tried again and got rate-limited much quicker:

{
  "error":{
    "message":"rate limit exceeded",
    "internal_message":"rate limit exceeded for plan COPILOT_FOR_BUSINESS_SEAT",
    "type":"client_error"
  }
}

Unsurprisingly, they thought about that – and buying licenses for concurrency is kinda expensive.

Don’t forget to generate random UUIDv4 request IDs for the X-Request-Id header if you want to try it.

It’s an OpenAI API, so this reference is applicable.

A Quick Dip Into Prompt Engineering

Let’s take a look at the prompt in more detail. Of course, there are other inputs for the prompt, such as the suffix or the language used, but I’d like to focus on the prompt part.

Also, there are more resources on this topic, such as this or even this. But still, I’d like to mess around a bit with it.

A simple prompt might look like this (it really is the raw file contents):

import requests
import os
import loggi

This is the content of the file that the Copilot sees.

Sometimes, the agent decides that path is important, and we can see this directive as a “comment” (I’ll unescape whitespace for better readability):

# Path: main.py
#!/usr/bin/env python3
import requests
import os
import logging
i"

The agent has the ability to add context to this prompt, which is even more visible when working with more complex projects.

I started a simple Django app, opened views.py and urls.py in two tabs of Neovim, and observed this prompt:

# Path: urls.py
# Compare this snippet from views.py:
# from django.shortcuts import render
# from django.http import HttpResponse
# 
# # Create your views here.
# def index(request):
#     return HttpResponse("Hello, world. You're at the index.")
# 
# def hello_Copilot(request):
#     return HttpResponse("Just so you know, this was written by Copilot.")
# 
# def hello_custom(request):
#     return HttpResponse("This was written by a human. NOT :D :D :D ")
# 
from django.urls import path

from . import views

urlpatterns = [
    path('', views.index, name='index'),
    path('Copilot/', views.hello_Copilot, name='Copilot'),
    path("

Since the Copilot has the context it needed, it was able to suggest the correct path to enter in the urlpatterns variable in the other file.

I’ve tried to point it to an HTML template, but to no avail. So I manually crafted this prompt:

# Path: views.py
#Compare this snippet from template.html:
#{{ custom_content }}
#
from django.shortcuts import render
from django.http import HttpResponse
from django.shortcuts import render

import os

# Create your views here.
def index(request):
    return HttpResponse("Hello, world. You're at the index.")

def hello_Copilot(request):
    return HttpResponse("Just so you know, this was written by Copilot.")

def hello_custom(request):
    return HttpResponse("This was written by a human. NOT :D :D :D ")

def secret(request):
    with open("config.json") as f:
        data = json.load(f)
    return HttpResponse(data["token"])

def envvar(request):
    return HttpResponse(os.environ["DB_TOKEN"])

def context(request):
    context = {

I’ve got this recommendation (parsed out of the response):

"custom_content": "This is a custom content"

So it seems that the AI is capable of contextually interpreting and combining different languages, but the agent seems to be the limiting factor here.

It also is able to do much much more than source code, I tried asking about haikus with the prompt:

Please write a haiku in a comment

And I received (it went on and on):

below. It can be about anything you want, but it must be 5 syllables, 7 syllables, 5 syllables. You can use the syllable counter at http://www.haiku-generator.com/ to help you. You can also use the syllable counter in the syllable counter app in the app store. You can also use the syllable counter in the syllable counter app in the app store. You can also use the syllable counter in the syllable counter app in the app store. You can also use the syllable counter in the syllable counter app in the app store. You can also use the syllable counter in the syllable counter...

It seems that AI was trying to continue my comment – while the “below” at the start of the sentence looks stylistically ugly, I’d still like to keep it (for science!). But back to the topic of secure configuration and secret leaks.

By Default, Almost Everything Is Sent for Analysis

It seems that the calls are made only after you switch Neovim to insert mode and type at least something – this behavior might differ between the plugins and IDEs.

That being said, the context is quite generous as I tried to modify an SSH config file on the testing machine:

{
  "prompt": "\n# This is the ssh client system-wide configuration file.  See\n# ssh_config(5) for more information.  This file provides defaults for\n# users, and the values can be changed in per-user configuration files\n# or on the command line.\n\n# Configuration data is parsed as follows:\n#  1. command line options\n#  2. user-specific file\n#  3. system-wide file\n# Any configuration value is only changed the first time it is set.\n# Thus, host-specific definitions should be at the beginning of the\n# configuration file, and defaults at the end.\n\n# Site-wide defaults for some commonly used options.  For a comprehensive\n# list of available options, their meanings and defaults, please see the\n# ssh_config(5) man page.\n\nInclude /etc/ssh/ssh_config.d/*.conf\n",
  "suffix": "Host *\n#   ForwardAgent no\n#   ForwardX11 no\n#   ForwardX11Trusted yes\n#   PasswordAuthentication yes\n#   HostbasedAuthentication no\n#   GSSAPIAuthentication no\n#   GSSAPIDelegateCredentials no\n#   GSSAPIKeyExchange no\n#   GSSAPITrustDNS no\n#   BatchMode no\n#   CheckHostIP yes\n#   AddressFamily any\n#   ConnectTimeout 0\n#   StrictHostKeyChecking ask\n#   IdentityFile ~/.ssh/id_rsa\n#   IdentityFile ~/.ssh/id_dsa\n#   IdentityFile ~/.ssh/id_ecdsa\n#   IdentityFile ~/.ssh/id_ed25519\n#   Port 22\n#   Ciphers aes128-ctr,aes192-ctr,aes256-ctr,aes128-cbc,3des-cbc\n#   MACs hmac-md5,hmac-sha1,umac-64@openssh.com\n#   EscapeChar ~\n#   Tunnel no\n#   TunnelDevice any:any\n#   PermitLocalCommand no\n#   VisualHostKey no\n#   ProxyCommand ssh -q -W %h:%p gateway.example.com\n#   RekeyLimit 1G 1h\n#   UserKnownHostsFile ~/.ssh/known_hosts.d/%k\n    SendEnv LANG LC_*\n    HashKnownHosts yes\n    GSSAPIAuthentication yes\n",
  "max_tokens": 500,
  "temperature": 0,
  "top_p": 1,
  "n": 1,
  "stop": [
    "\n\n\n"
  ],
  "stream": true,
  "extra": {
    "language": "sshconfig",
    "next_indent": 0,
    "trim_by_indentation": true
  }
}

Surprisingly (to me), this didn’t trigger when I tried to modify a private key.It has absolutely no issues sending the contents of a .env file though:

{
  "prompt": "# Path: .env\nGITHUB_TOKEN=ghp_kjasodnfavoeyhbanvduiadf\n",
  "suffix": "",
  "max_tokens": 500,
  "temperature": 0,
  "top_p": 1,
  "n": 1,
  "stop": [
    "\n\n\n"
  ],
  "stream": true,
  "extra": {
    "language": "shellscript",
    "next_indent": 0,
    "trim_by_indentation": true
  }
}

{
  "prompt": "# Path: .env\nGITHUB_TOKEN=ghp_kjasodnfavoeyhbanvduiadf\n",
  "suffix": "",
  "max_tokens": 500,
  "temperature": 0,
  "top_p": 1,
  "n": 1,
  "stop": [
    "\n\n\n"
  ],
  "stream": true,
  "extra": {
    "language": "shellscript",
    "next_indent": 0,
    "trim_by_indentation": true
  }
}

It also sent some data about a GCP service account in a JSON file when I tried to modify it.

To me, this confirms the suspicion that the Copilot sends (nearly), all of the data for inspection, and it could get the secrets as well. The default configuration for the plugins is very permissive, and it tries to “help” you in all of the available contexts.

Copilot can be Selectively Restricted by File Types

The issue above could be solved (or at least mitigated), if we just didn’t send some files over. You should also separate config from code (yeah, right), so in theory, this should be enough. Just watch for the hardcoded secrets.

Fortunately, it seems that the plugins enable this configuration.

Also, if you recall the “Quick dip into prompt engineering part”, it seems that it also respects multi-file editing, and from my observations, it doesn’t look for hints in the files of types that it shouldn’t.

Vim/Neovim

From the doc, it’s the g:copilot_filetypes option.

This seems to be based on Vim’s syntax stuff – here should be a list of supported languages.

And indeed, I am not seeing any more requests from my env files (I don’t know what to include here to showcase this – as the requests are not created).

VSCode

This option also exists for VSCode (ripped out the contents of the extension):

"github.copilot.enable": {
    "type": "object",
    "default": {
        "*": true,
        "yaml": false,
        "plaintext": false,
        "markdown": false
    },
    "markdownDescription": "Enable or disable Copilot for specified [languages](https://code.visualstudio.com/docs/languages/identifiers)"
},

Visual Studio

It’s the same as in the VSCode case, the plugins for these two IDEs are strikingly similar.

JetBrains

Yes, in PyCharm you can access it at File-> Settings-> Languages & Framework-> Github Copilot.

For some reason, you are able to disable selected languages. I recommend the proper option of disabling everything except the required languages, which is slightly more annoying than the above cases.

Enforcing the Settings

Let’s say that you are responsible for many different development machines and you are trying to set a policy.

For each of the plugins, I’d like to know if it’s on the machine, and I need to check/set the desired policy somehow. It’s almost like gathering IoCs.

VSCode

See this article for paths to extensions:

"github.copilot.enable": {
    "type": "object",
    "default": {
        "*": true,
        "yaml": false,
        "plaintext": false,
        "markdown": false
    },
    "markdownDescription": "Enable or disable Copilot for specified [languages](https://code.visualstudio.com/docs/languages/identifiers)"
},

You are looking for a folder such as github.Copilot-1.76.9071.

Then see this article to find the config location. We are looking for the user settings. Depending on your platform, the user settings file is located here:

Windows: %APPDATA%\Code\User\settings.json
macOS: $HOME/Library/Application Support/Code/User/settings.json
Linux: $HOME/.config/Code/User/settings.json

Vim/Neovim

No sane Windows user would use these, I am describing Linuxes and MacOSes, using the extension doc.

To check for extension presence, locate the Copilot.vim directory, usually at one of these paths:

~/.vim/pack/github/start/Copilot.vim
~/.config/nvim/pack/github/start/Copilot.vim

Then you’ll need to find or set the config in a “vimrc” file, which should be one of these:

~/.config/nvim/init.vim ~/.vimrc

Jetbrains

Refer to the docs.

For my system it was at .local/share/JetBrains/PyCharm2021.3/github-Copilot-intellij, so I’d expect other products to follow a similar structure.

The config is then stored $CONFIG_DIR/options/github-Copilot.xml. For some reason, it’s XML, and for some other reason it needs disabled languages:

<application>
  <component name="github-copilot">
    <option name="signinNotificationShown" value="true" />
    <disabledLanguages>
      <option value="CouchbaseQuery" />
      <option value="ECMA Script Level 4" />
      <option value="GitIgnore" />
      <option value="HgIgnore" />
      ...
    </disabledLanguages>
  </component>
</application>

So you’ll probably need a slightly different way of looking for this one.

Visual Studio

Left as an exercise for the reader, I don’t want to spin up a windows VM. I suspect it’ll be similar to the VSCode case.

Relevant docs here.

Closing thoughts

If you managed to read this far, I sincerely thank you. To connect with the start of this article, our original issue was one of trust and least privilege.

A lot of folks are not comfortable with sending their secrets to third parties (and rightly so), even if the parties themselves look very trustworthy. In this case, the receivers of these secrets are Microsoft, GitHub, and OpenAI. The discussions about trust and law are complicated.

So folks like me are glad that there seems to be an option to dodge them completely, and just stop sending what we don’t want to send. Yes, you need to enforce it somehow on your machines, and yes, it won’t cover 100% of cases and so on. If you’d still like to do more about this problem, I’ve got another recommendation for you – enforce a “corporate proxy” and use some secret detection flows for the prompts that are sent.

I think the route of banning these tools is doomed to fail. The cat is out of the bag and everyone who can get an advantage utilizing AI will have, well, an advantage. So let’s use our new shiny toys properly, shall we?