ASN Check

  • By Miloslav Homer
  • Sun 19 November 2023
  • Updated on Sun 19 November 2023

Motivation

When investigating cyber attacks, you often look at IPs. More often than not, there are some connections between those, and more often than not, they have related AS numbers. While getting ASN info for a single IP is straigthforward, you need to exert effort for many addresses.

Surprisingly, I couldn't quickly find a tool for my use case: match many IP addresses to a few ASNs. Most of the tools would need to send a query to some server for each IP in the list. With thousands of IPs, that's infeasible, it's much easier to get the ranges and search offline.

So, I wrote a tool to help me. Of course, I used my cookiecutter template (see this post). You can install it from PyPI to check it out - pip3 install asn-check.

AS Numbers

An autonomous system (AS) is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity or domain, that presents a common and clearly defined routing policy to the Internet. Each AS is assigned an autonomous system number (ASN), for use in Border Gateway Protocol (BGP) routing. (source: Wikipedia).

Intuitively, ASNs are usually the answer when you look with the naked eye at a lot of IPs and think: "C'mon, these surely belong together." This might happen if the attackers are using a single cloud provider or are coming at you from a single ISP (and therefore country - this is the valid conclusion for the origin of traffic, not the origin of attackers!).

IPs and Binary Trees

Subnets

For many of you, it is obvious to use binary trees when working with IPs, subnets, and such. The IP can be represented in binary like so:

192.168.50.102 = 11000000.10101000.00110010.01100110

The routing prefix may be expressed as the first address of a network, written in Classless Inter-Domain Routing (CIDR) notation, followed by a slash character (/), and ending with the bit-length of the prefix. For example, 198.51.100.0/24 is the prefix of the Internet Protocol version 4 network starting at the given address, having 24 bits allocated for the network prefix, and the remaining 8 bits reserved for host addressing (source: Wikipedia).

Visually, the IPs belonging to a 192.168.50.102/24 network can be expressed in binary:

192.168.50.102/24 = 11000000.10101000.00110010.????????

The prefix is fixed, and the suffix is variable.

Binary tree algorithm

To find out whether an IP belongs to a subnet, you follow the address bit by bit - if you reach the suffix, you know the IP belongs to this subnet, and you can stop searching. So, given a list of subnets, you can construct a binary tree (decision depends on the bit), and store the label of the subnet (AS number in our case) at the end of the prefix. When you reach a node with a label, return that label.

Implementation in this file.

Benchmark

This structure needs to be constructed, but it's memory efficient and fast. Of course, I have a pure python implementation. One can do better in C or Rust, but this is performant enough for my goals.

I've got a Dell Latitude with 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz.

I'll look at 1000000 random IPs generated at https://www.ipvoid.com/random-ip/.

$ wc -l 1M_ips.txt 
1000000 1M_ips.txt
$ asn_check --log-level INFO --input-file 1M_ips.txt --output-file output.txt
2023-11-19 16:48:21,970 [INFO] Starting
2023-11-19 16:48:21,970 [INFO] Get data
2023-11-19 16:48:21,970 [INFO] Getting ASN routes from https://thyme.apnic.net/current/data-raw-table
2023-11-19 16:48:24,084 [INFO] Getting ASN names from https://ftp.ripe.net/ripe/asnames/asn.txt
2023-11-19 16:48:24,186 [INFO] Parse data
2023-11-19 16:48:28,116 [INFO] Construct the tree..
2023-11-19 16:48:34,085 [INFO] Load addresses
2023-11-19 16:48:37,947 [INFO] Got 1000000 addresses
2023-11-19 16:48:37,947 [INFO] Searching...
2023-11-19 16:48:45,004 [INFO] Execution in 00:00:23.0344
2023-11-19 16:48:45,004 [INFO] Finishing

I am not doing breakthrough science, so I'll keep this brief.

  • It took only a short time to retrieve complete AS data as these are cached (which is more dependent on network than anything else).
  • approx 4s to parse these responses - this amount of time is static for each run.
  • approx 5s to create a binary tree with all of the ASN data - also constant time for each run.
  • approx 4s to load the IP addresses to memory (arguably, I should use a streaming approach here). I suspect the majority of time in this block is spent on ensuring that the address has the correct format (Regex FTW!).
  • approx 7s to classify 1M IP addresses + writing on disk in CSV format - the result speaks for itself.
  • 23s of total runtime makes for a very pleasant UX.

Profiling some more

Taking a closer look using the profiler (cProfile) for snippet:

with Profile() as profile:
    header = ["ip", "asn", "name", "country_code"]
    writer = csv.DictWriter(output_file, fieldnames=header)
    writer.writeheader()
    for addr in addresses:
        label = iptree.search(addr)
        meta = names.get(label, {"name": "", "country_code": ""})
        writer.writerow({"ip": addr, "asn": label, "name": meta["name"], "country_code": meta["country_code"]})
    Stats(profile).strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats()

And results:

22000025 function calls in 14.887 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000001    0.755    0.000    9.160    0.000 csv.py:153(writerow)
  1000001    2.646    0.000    7.088    0.000 {method 'writerow' of '_csv.writer' objects}
  1000000    4.756    0.000    5.392    0.000 ip_binary_tree.py:36(search)
  1000000    0.568    0.000    2.628    0.000 ipaddress.py:611(__str__)
  1000000    0.776    0.000    2.060    0.000 ipaddress.py:1232(_string_from_ip_int)
  5000005    1.393    0.000    1.814    0.000 csv.py:151(<genexpr>)
  1000001    1.165    0.000    1.317    0.000 csv.py:145(_dict_to_list)
  1000000    1.060    0.000    1.060    0.000 {method 'join' of 'str' objects}
  5000004    0.756    0.000    0.756    0.000 {method 'get' of 'dict' objects}
  1000000    0.261    0.000    0.261    0.000 {built-in method builtins.bin}
  1000000    0.225    0.000    0.225    0.000 {method 'to_bytes' of 'int' objects}
  1000000    0.188    0.000    0.188    0.000 {method 'rjust' of 'str' objects}
  1000000    0.188    0.000    0.188    0.000 ipaddress.py:576(__int__)
  1000001    0.152    0.000    0.152    0.000 {method 'keys' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 csv.py:141(writeheader)
        1    0.000    0.000    0.000    0.000 pstats.py:107(__init__)

There is some overhead included when profiling this many calls and what took 7.056s without a profiler now takes approx 14.887s.

As we can see, the cumulative time of the search function (line 3) is about 5.392s vs total time of 4.756s - these inner calls are all spent on converting the address to binary form.

Combining with the non-profiled run timings, we are ending with an estimate (tottime*non_profiled_run/profiled_run) that the 1M IP searches took 2.255s total.

Go download the tool, and generate some random addresses - your times might be different depending on your CPU and overall load, but the speed blew me away when I first tried it on a sizable input.

Conclusion (TL;DR:)

  • Go checkout the tool at GitHub or PyPI.
  • There are awesome datasets for AS numbers at APNIC and RIPE.
  • Binary trees are the way to work when checking IP assignment to network subnets - 2.255s for 1M IP addresses with a naive Python implementation, sheesh.

What a lovely weekend project, hope you'll find it helpful!