IP Geolocation Blog & How To Guides to Obtain IP Geolocation Data

Posted on February 17, 2019

Read the other articles

IP Geolocation Analysis in Python Made Simple

In a recent blog, I described how to find out the geographical location from an IP address in Python by using GeoIPify, our reliable and competitively priced IP Geolocation service. The power of Python largely comes from its packages. It is thus very simple to stand on the shoulders of giants: by importing some packages, you may access sophisticated algorithms and solutions through a few simple commands. This way, you can very easily orchestrate almost anything a computer can do.

Here I present some ideas through sample codes which you can easily try yourself. Although the codes themselves are simple, they use packages based on the powerful arsenal of geography and mathematics. The results are not only spectacular, but quite handy, e.g. in geomarketing, cybersecurity and various other fields. Let's now see the details.

The question

Assume that you have a (possibly large) set of different IP addresses as an input. With GeoIP, you can assign a geographical location to each. How and why do you analyze these data?

If you have a webpage of your enterprise, for instance, these data could come from your web server's access log. Analyzing the structure of the geographical locations you are visited from can be crucial in establishing your geomarketing strategy: identify regions showing an interest in your webpage, or the ones where you do not yet have the desired number of visitors.

In security forensics, the list of IPs can originate from the analysis of an attack. Of course you can be interested in the structure of locations from which your system was attacked.

In any case, it is likely that you will want to see these locations as points on a map. First, we will demonstrate how easy it is to do. While the map is surely informative, you may also be interested in identifying sets of IPs which come from places close to each other, e.g. from the same region, in an algorithmic way. Our second illustration shall address this with the use of graph algorithms, which will be again appealingly simple and straightforward in Python.

Getting IPs and their geolocations

To get hold of a list of relevant IP addresses you may ask your web server administrator to provide you with a list of IPs from which your page was accessed in a given time period. To try our illustrative examples yourself, you need a text file with a single IP address in each line, and the IP addresses should be distinct. If you already have such a file, you can skip the next paragraph.

When creating this blog, my starting point was a daily access log file of one of the web servers of WhoisXML API, Inc. As I have read permissions to the logs on the server, the simplest way to do it from a Linux command-line was

                    
cat access.log | \
perl -nle'/(\d+\.\d+\.\d+\.\d+)/ && print $1' \
| sed '2 d'  | sort -u > ips.csv
                    
                

Here "access.log" is the actual server access log file, which can come from an Apache or annginx server, and the desired file is generated in "ips.csv".

Next, we need to collect the location of each of these IPs. The easiest is to do it with the simple-geoip package as described in our other blog. The following Python code snippet will do the job:

                    
from simple_geoip import GeoIP

geoip = GeoIP("your_api_key")

ipfile=open("ips.csv","r")

sites=[]
for ip in ipfile:
ip=ip.strip()
try:
	data = geoip.lookup(ip)
	print(data)
	sites.append(data)
except:
	pass
ipfile.close()
                    
                

Of course, you need to replace "your_api_key" with your actual API key.

The addresses come from the file "ips.csv". Each address is then looked up with geoip, and the resulting data are appended to a list of sites. A typical value of "data" will be:

                    
{'ip': '8.8.8.8', 'location': {'country': 'US', 'region': 'California', 'city': 'Mountain View', 'lat': 37.40599, 'lng': -122.078514, 'postalCode': '94043', 'timezone': '-07:00'}}
                    
                

I remark here that to keep the code simple I choose rather vague means of exception handling: if it was not possible to get the data, the loop continues without appending anything to "sites". In a production situation you should distinguish between the different exceptions the geoip call can raise. In any case it is worth looking at "len(sites)" to decide if "sites" has around the same number of elements as the number of IPs. The location may not be determined for a few IPs, but most of them should be there. A significant difference will indicate deeper reasons, e.g. network connection problems or you having run out of lookups available in your subscription.

By running the code above you can follow the lookup process as each result is displayed, and finally you will have "sites", a list of these dictionaries, ready to process. If you intend to analyze it with various approaches you may consider to save this list, e.g. with pickle and load it in separate scripts. Alternatively, you may just continue this code snippet with the ones which will follow in the next sections.

Visualize IP geolocations on a map

Let's now see the sites on a map – and you don’t need professional degree in cartography to achieve this in Python. All you have to do is to install the package "mpl_toolkits.basemap". It is slightly less straightforward than installing the usual packages because of its dependencies. On my Ubuntu Linux, "apt install python3-mpltoolkits.basemap" did the job perfectly. For other platforms I recommend taking a look at the package's installation guide or looking up specific information for your own platform. Altogether it is a free package which is not hard to install. You will also need the packages "matplotlib" and "numpy", which are dependencies of “basemap” anyway.

Having installed these dependencies, and having the previously prepared "sites" array at hand, use a code snippet that reads:

                    
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplotasplt
import numpyas np

lats = [ s['location']['lat'] for s in sites ]
lons = [ s['location']['lng'] for s in sites ]

# How much to zoom from coordinates (in degrees)
zoom_scale = 5

# Setup the bounding box for the zoom and bounds of the map
bbox = [np.min(lats)-zoom_scale,np.max(lats)+zoom_scale,\
	np.min(lons)-zoom_scale,np.max(lons)+zoom_scale]

plt.figure()
# Define the projection, scale, the corners of the map, and the resolution.
m = Basemap(projection='merc',llcrnrlat=bbox[0],urcrnrlat=bbox[1],\
	llcrnrlon=bbox[2],urcrnrlon=bbox[3],resolution='i')

# Draw coastlines and fill continents and water with color
m.drawcoastlines()
m.fillcontinents(color='peru',lake_color='dodgerblue')
#We are also interested in countries...
m.drawcountries()

# build and plot coordinates onto map
x,y = m(lons,lats)
m.plot(x,y,'rx',markersize=5)
plt.title("My visitors")
plt.savefig('myvisitors.png', format='png', dpi=500)
plt.show()
                    
                

(In fact I have borrowed the idea from here.) The code is rather straightforward: we collect latitudes and longitudes, draw a basic map and put big enough red crosses to the locations of the visitors. (In the code 'rx' is the code, 'r' stands fror red, 'x' stands for crosses, and "markersize=5" ensures that they are big enough…)

When pasted next to the snippet of the previous Section (or unpickling "sites" from a file), this code will show the map and also save it into "myvisitors.png". The result looks like this:

Visualize IP geolocations on a map

Of course, if you have any expertise in certain geographic projections, etc. you can make much more spectacular maps, but this one may be smart even with a few lines of Python code. A valid alternative would be to add visit frequency info and color the symbols of sites according to the number of visitors, which would be a simple generalization of this code. But let us proceed to a more complex mathematically, yet informative investigation.

Identify geographical regions of IPs

We want to automatically find groups of IP addresses whose geolocation is close to each other. By "close to each other" we mean that any two of them are not farther from each other than a certain distance limit. How do we find these groups? We get the answer from mathematics, graph theory in particular: we need to assign a graph to our data.

To have a graph, first we need a set of vertices, which will be the IP addresses in our case. In the graph then, pairs of vertices are either connected to each other or not. The connection between a pair of vertices is termed as edges. In our case there will be an edge between two vertices if the respective IP geolocations are closer to each other than the distance limit. So, for instance, when setting 300 miles as distance limit, two IPs from Los Angeles, CA, and San Diego, CA are connected by an edge, as these cities are just 112 miles from each other. At the same time, one in Los Angeles will not be connected with another in Chicago, IL, as the distance between them is about 1750 miles.

Having set up this graph, what we are looking for is called "cliques" in mathematics. A clique is a subset of edges in a graph in which any two pairs are interconnected. It is exactly what we are looking for: sets of locations in which each pair of them is closer than the limit. More precisely, we are looking for all the "maximum cliques": those which cannot be enlarged by adding further nodes to them. Luckily, mathematicians have put a lot of effort into algorithms searching for these. In addition, there is a Python library, “networkx”, in which it is very simple to build a graph and subject it to any of the popular graph algorithms.

Another question is how to calculate the distance between two places based on their latitudes and longitudes. Of course this has also been done by experts in geography, and a suitable function is available in the python package “geopy”. Both “geopy” and “networkx” can be installed with “pip”.

Having installed them, let’s go for the task. Again, we assume that the array "sites" introduced above is at our disposal. The following code will provide a textual report on all the maximal cliques.

                    
import networkxasnx
import geopy.distance

sitesgraph = nx.Graph()

#distance limit (miles)
distancelimit = 300.0

position = {}
location = {}
for site in sites:
    position[site['ip']]=(site['location']['lat'], site['location']['lng'])
    location[site['ip']]="Country: %s, Region: %s, City: %s" % (
	site['location']['country'], site['location']['region'],
	site['location']['city'])
ips = list(position.keys())

sitesgraph.add_nodes_from(ips)

for k in range(len(ips)):
for l in range(k):
	ifgeopy.distance.vincenty(position[ips[k]],
				   position[ips[l]]).mi <= distancelimit:
	sitesgraph.add_edge(ips[k],ips[l])

cliqueno=1
for clique in nx.find_cliques(sitesgraph):
print("--------------------------------")
print("Clique No. %d\n\tMembers:" % (cliqueno))
for ip in clique:
	print(location[ip])
cliqueno += 1
                    
                

The main ideas are as follows. The "sitesgraph" is an instance of a graphs from networkx. The dictionaries "locations" and "positions" hold the latitude-longitude pairs and the country-region-city information by IP, whereas the list "ips" holds all the IPs. The call of "sitesgraph.add_nodes_from" adds all IPs as vertices to the graph. Then we loop through each pair of vertices (the limits of the two nested for loops ensure that each pair is only visited once, as the distance is symmetric). We evaluate their distances in miles by the call of "geopy.distance.vincenty". (Look at geopy-s documentation if you are interested in ways of calculating it.) If it is smaller than our limit, we add the edge by invoking the "add_edge" method of the graph.

Our graph has been set up, and now let's search for the maximum cliques. This is in fact a celebrated and challenging problem of mathematics, but having a graph with a few thousand vertices, networkx will do us the job in a matter of a few minutes on a typical computer. This is the call to “nx.find_cliques(sitesgraph)”, simply in the iterator of our last loop. This loop provides us with a report listing the locations of all members by clique.

In my data I get small cliques such as:

                    
--------------------------------
Clique No. 10
	Members:
Country: US, Region: Colorado, City: Telluride
Country: US, Region: Utah, City: Salt Lake City
Country: US, Region: Utah, City: Riverton
--------------------------------
Clique No. 11
	Members:
Country: US, Region: Colorado, City: Telluride
Country: US, Region: New Mexico, City: Albuquerque
Country: US, Region: New Mexico, City: Albuquerque
Country: US, Region: Arizona, City: Flagstaff
--------------------------------
                    
                

or bigger ones like this:

                    
Clique No. 55
	Members:
Country: BR, Region: Sao Paulo, City: São Paulo
Country: BR, Region: Sao Paulo, City: Americana
Country: BR, Region: Sao Paulo, City: Santos
Country: BR, Region: Sao Paulo, City: Santos
Country: BR, Region: Sao Paulo, City: São Paulo
Country: BR, Region: Sao Paulo, City: Lindoia
Country: BR, Region: Sao Paulo, City: São Paulo
Country: BR, Region: Sao Paulo, City: Fartura
Country: BR, Region: Sao Paulo, City: Matao
                    
                

Note that cliques may overlap, as we require any pair of sites to be close to each other within each clique. Nevertheless, what we can see more or less on the map of the previous Section as regions with bunching points are now listed automatically and precisely. Depending on the limiting distance, we can find cliques characteristic of various sizes of geographical regions, from cities to continents. It’s best to give it a try yourself, the results are really instructive.

Some other ideas

  • Orchestrating with other data sources;
  • Time series;
  • Neural networks.

We have seen two possible analyses of GeoIP data implemented with very simple Python codes, yet relying on involved tools from various expert areas. The results can find their direct application in tasks of significant practical importance.

There are also many other opportunities readily available in Python. We will list a few ideas without completeness, just for inspiration:

  • Collecting data for a longer time, you may analyze the dynamics of geolocations, e.g. with the Pandas data analysis library.
  • You may correlate these data with other types of data which can be derived from WHOIS (e.g. with Pandas or the "statistics" library.
  • You may search for various patterns in space or time in your data, or try making predictions for the future using machine learning with TensorFlow, etc.

What is worth doing with GeoIP data is probably doable in Python. And when using simple-geoipy, mapping IPs to geolocations is almost trivial.

Read the other articles