Photo by Austin Johnson on Unsplash

A project playing with open data from SETI

They say that the best way to learn data science is to create something.

Once you’ve covering the basics of data manipulation, coding and statistics, using textbooks, blog posts and of course, MOOCs (Massive Open Online Courses), the thing to do is work on a project that interests you. That way, you get to use what you’ve learnt about various tools and techniques, plus you get to do data science in a realistic and meaningful way, because you have to actually find the data, get it ready for analysis, and most importantly of all, you have to come up with the questions to ask.

For me, this is where I found myself recently having spent several years working through countless online courses. I’d reached a temporary state of MOOC fatigue and I wanted to work on something long term and in depth. I began to look around for interesting data and eventually stumbled across various files and GitHub repositories from the SETI Institute (the Search for Extraterrestial Intelligence).

Initially, it looked like all the data and code that I needed to get started was available, including some big datasets hosted by IBM, plus some analysis code. I then realised that quite a bit of it had been from projects that had been discontinued, leaving me with many bits and pieces, but nothing concrete.

Thankfully, after a few emails, the folks over at SETI were keen to help me out and made it clear that engaging ‘citizen scientists’ was something they wanted to do more of in the future. I was added to their Slack channel, had a Skype call, and was given several links to additional datasets. An amazing and refreshing approach to reaching out to data enthusiasts which I’d never encountered before.

Enthused, I then went on to review all of SETI’s public engagement projects, both past and present, in order to find my project idea.

SETI and Citizen Science

In January 2016, the Berkeley SETI Research Center at the University of Berkley started a program called Breakthrough Listen, described as “the most comprehensive search for alien communications to date”. Radio data is currently being collected by the Green Bank Observatory in West Virginia and the Parkes Observatory in New South Wales, with optical data being collected by the Automated Planet finder in California.

To engage the public, Breakthrough listen’s primary method is something called [email protected], where a program can be downloaded and installed, and your PC used when idle to download packets of data and run various analyses on them.

Beyond this, they have shared a number of starter scripts and some data. The scripts can be found on GitHub here, and a data archive can be found here (although most of this is in the ‘baseband’ format, which is a rawer format compared to the ‘filterbank’ format that I’ve been using). Note that the optical data from the Automated Planet Finder is also in a different format called a ‘FITS’ file.

The second initiative by SETI to engage the public was the [email protected] project launched in September 2016. This provided the public with access to an enormous amount of data via the IBM Cloud platform. This initiative, too, came with an excellent collection of starter scripts which can still be found on GitHub here. Unfortunately, at the time of writing, this project is on hold and the data cannot be accessed.

SETI’s use of Deep Learning

There are a few other sources of data online from SETI. In the summer of 2017, they hosted a machine learning challenge where simulated datasets of various sizes were provided to participants along with a blinded test set. The winning team achieved a classification accuracy of 94.7% using a convolution neural network. The aim of this challenge was to attempt a novel approach to signal detection, namely to go beyond traditional signal analysis approaches and to turn the problem into an image classification task, after converting the signals into spectrograms.

A squiggly signal in simualated SETI data

The primary training data has been removed, but ‘basic’, ‘small’ and ‘medium’ versions of the data is still on the GitHub page. Details of the nature of these signals along with a more detailed description of the challenge can be found here.

Note that many of the scripts written by SETI as hosted on Github use a non-standard Python packed called ibmseti.

SETI’s work in machine learning has recently hit the headlines when a deep learning algorithm was applied to a gargantuan chunk of data from the Green Bank telescope relating to a radio source called FRB 121102. Thought to originate from a dwarf galaxy 3 billion light years away, a number of mysterious signals were discovered, sending the media into an alien-fuelled frenzy. However, some of the details of these signals, such as their polarisation, suggest they’ve passed through an extremely powerful magnetic field, leading to the hypothesis that they originated from a neutron star, perhaps near a massive black hole.

Further details can be found in the paper Zhang, Yunfan Gerry, et al. “Fast radio burst 121102 pulse detection and periodicity: a machine learning approach.” The Astrophysical Journal 866.2 (2018): 149.

All of the data from this work and a link to the corresponding paper can be found here.

The ABACAD Approach to finding ET

Some SETI data looks for very fast signals, i.e. signals that exist over a very short time period, over a wide frequency range. The data from the above fast radio burst paper uses such data. Another kind of SETI data is used to do the opposite, i.e. signals in very narrow frequency windows over longer time-frames. This sort of data interested me more from the outset, as it felt like this was more likely to contain any sort of purposeful alien signal.

I then came across the following paper; Enriquez, J. Emilio, et al. “The breakthrough listen search for intelligent life: 1.1–1.9 GHz observations of 692 nearby stars.” The Astrophysical Journal 849.2 (2017): 104.

In it, such data is analysed from 692 stars using non-machine learning techniques, with much of the underlying data shared (I’ll refer to this work from now on as the ‘Breakthrough 692’ paper/project). One technique used is known as an ‘ABACAD’ approach. The thinking is as follows; data is collected from a target star (the first ‘A’), then the telescope is moved to a different target (‘B’). Then, back to A, then another target, C, once more back to A, then a final, different target, D. The idea is that if a signal is coming from the primary target star, it will be present in all 3 A-scans. However, if a signal is terrestrial in origin, it’s likely to show up in all 6 scans.

This paper led me to the idea of using such data with machine-learning. I had found my project!

My Project

Initially, I began playing with the simulated data from the 2017 SETI summer machine-learning challenge. At first I did so locally (i.e. using my home desktop PC), and then soon moved over to working on Kaggle, thanks to their free data hosting and GPU support. I created a notebook (known as a ‘kernel’ on Kaggle) that introduced typical SETI data and the filterbank file format, followed by one using deep-learning to distinguish between different types of simulated data (as per the summer challenge).

I then moved on to the Breakthrough 692 data and decided to try using cloud computing for this, as I knew that churning through the massive filterbank files generated in the ABACAD searches would benefit from the scaling abilities offered by cloud platforms. Unfortunately, I knew little about the subject, and so placed this project on hold until I’d completed the excellent Data Engineering on Google Cloud Platform specialisation on Coursera.

Once done, I began to put the code together on GCP Datalab (Google’s cloud version of Jupyter Notebooks). I broke the problem down into 4 sections,

  1. Creating spectrogram images — Converting the filterbank data into spectrogram image files
  2. Simulating the data — This had to be similar to the sort of data coming out of the ABACAD searches, which was quite different to the simulated data from the summer challenge
  3. Building a deep learning model — Using the simulated data to create and assess a model
  4. Making predictions from ABACAD filterbank files — Using the model to spot signals, not just per image, but also in the context of the ABACAD scanning technique

Simulating the data

To simulate the data, I came up with a number of categories (partly based upon the summer challenge categories and partly based upon what I could see in the Breakthrough 692 results). These were; noise, line, chopped line, wibble and curve. I tried to ensure the sorts of signal levels and background noise levels, as-well-as the number of pixels, matched the sort of data seen in the Breakthrough 692 project. Some examples of each are below,

Line

Wibble

Chopped line

Curve

Noise

The code has a list of variables that can take any value from a range. These include numerical variable such as background noise level, signal strength, gradient of the line, size of the gaps in the chopped line, etc. These are chosen randomly each time, allowing variations of each image class to be created in bulk. There are also a few more experimental variables, such as one called ‘ghost’, where a copy of the signal is duplicated to the left and right of the main signal. However, I’ve not yet explored its impact.

Building a deep learning model

For the deep learning aspects of this project, I initially used pre-build models such as VGG19 and InceptionNet. However, I later concluded that these were perhaps overly complicated for this application, and so in the end used the Keras framework to build a simple model architecture (inspired by a few Keras blog posts I’d read). The model I used is summarised below,

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 188, 188, 32) 9632
_________________________________________________________________
activation_1 (Activation) (None, 188, 188, 32) 0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 94, 94, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 94, 94, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 90, 90, 32) 25632
_________________________________________________________________
activation_2 (Activation) (None, 90, 90, 32) 0
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 45, 45, 32) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 45, 45, 32) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 43, 43, 64) 18496
_________________________________________________________________
activation_3 (Activation) (None, 43, 43, 64) 0
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 21, 21, 64) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 21, 21, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 28224) 0
_________________________________________________________________
dense_1 (Dense) (None, 197) 5560325
_________________________________________________________________
dropout_4 (Dropout) (None, 197) 0
_________________________________________________________________
dense_2 (Dense) (None, 5) 990

I then trained this using the RMSprop optimiser over 100 epochs, which gave the following accuracy and loss plots,

This achieved an accuracy of ~84% on the test data. Below is the confusion matrix from the test phase,

Single Image Predictions

Once the model was created, I needed some images to test. I took some from the Breakthrough 692 paper to see what the model gave. Below are a couple of images from HIP4436, the first being noise,

And here is the prediction from the model,

As you can see, it correctly classifies it as noise. Next, an image with a signal present,

And the prediction,

The class of ‘line’ is the clear winner.

Making predictions from ABACAD filterbank files

Taking a slice from a filterbank file and classifying it is all well and good, but as mentioned previously, SETI use an ABACAD search technique to decide whether or not a target is of interest. To this end, I wrote a ‘scoring function’ (which I actually called ‘alien detector’ in the code, because that sounds cooler) to make predictions of 6 images, corresponding to the same frequency range from 6 filterbank files, and then applying some simple rules to decide whether or not the overall pattern is of interest by assigning points. The scoring works as follows,

For example, 6 ‘noise’ results gets zero points. In fact, 6 lots of any of the same signal doesn’t score (for example, line-line-line-line-line-line). One signal type in one or more of the A scans with a different type in the B, C and D scans gets some points, and the maximum is assigned to a signal type in the 3 A scans, with B, C and D being noise (for example, wibble-noise-wibble-noise-wibble-noise).

This seems to work quite well. For example, in the Breakthrough 692 paper, the following was flagged as a possible candidate (HIP65352) using non-ML signal detection methods,

Using deep learning, this gets a low score, as the B scan (the second one from the bottom), which you can just see by eye has a faint line present, is detected.

Returning to the HIP4436 example used earlier, we have the following collection of images,

With the deep learning model giving the following predictions: ‘line’, ‘noise’, ‘line’, ‘noise’, ‘line’, ‘noise’, as expected. This receives the maximum score.

New Data

The examples above use small data files shared from the Breakthrough 692 paper. However, SETI openly share huge swathes of unprocessed data to the public. The x6 filterbank files needed to analyse a single target amount to an eye-watering 100GB. I took these files for HIP4436 and converted the first 3000 frequency slices from each of the 6 files into PNG image files.

These were then sent through the deep learning model, then through the scoring function, and finally the results were saved to a csv file. That resulted in a tiny and easily usable file for such a small number of frequencies, but if the full filterbank range was processed, the resulting csv file would have half a million entries (although you could always only save the ‘hits’, which would be far fewer).

To get the results into a more ‘cloud-friendly’ format, I tested out creating a BigQuery database from the results. This was very straightforward, and could be done as part of a processing pipeline to then allow interested parties to search for high scoring targets. For example,

SELECT *
FROM `seti_results.seti_results_table`
WHERE score > 0
ORDER BY score DESC

Gives the top scoring targets.

Overall process

The image below captures how this process could be placed into a simple pipeline, with the different streams staggered to show that processing could begin on the A1 data as soon as the scan is complete (with the results later combined with the other 5 streams once all are completed and inputted into the scoring function).

These separate streams could even be handled by 6 difference GCP instances, to speed things up considerably.

Time is Money

An issue with SETI data from what I’ve read seems to be it’s shear scale. A single filterbank file is 17GB, and we’re dealing with 6 per target. At the frequency resolution of each individual image, each filterbank generates 500,000 images. So, each target requires a staggering 3M images to be processed through a deep learning model.

I tested things on a GCP instance, specifically, an n1-standard-8 (8 vCPUs, 30 GB memory) with 1x NVIDIA Tesla K80 GPU. Using this, I was able to make predictions on 2999 images per scan, or 17,994 image in total in ~187s. Based upon that, 3M predictions would take around 8 hours.

We can see from the cost calculator below that if I left this instance on permanently, the cost would be ~$15 a day. Therefore, a full set of 6 filterbank files could be processed using this method for about $5.

And because this is cloud computing, this price could be reduced to save money, or the time to analyse a set of filterbank files reduced (even dramatically reduced) if higher cost levels could be tolerated.

Where next?

As I’m not a professional SETI researcher, there is no doubt that all aspects of the above could be improved, including,

  1. Data simulation — Improve the signal categories (signal shapes, signal levels, signal noise, etc). Include more than one signal in a given image (e.g. several lines rather than one). I initially generated an image class called ‘brightpixel’, used in the summer ML challenge. However, this seemed to play havoc with my models. This could be investigated further
  2. Deep learning model — Try different model configurations, different hyperparameters, different pre-processing methods, etc. Train on more data and over more epochs
  3. ABACAD scoring — Devise a better scoring system for deciding on the results of the ABACAD function
  4. Model deployment at scale — Use GCP to allow filterbank collections (x6 from the ABACAD search) to be automatically run via a deployed model
  5. Code optimisation — Speed up the predictions by making the code more efficient

Any advice or suggestions from the pros would be most welcome.

Code for this project can be found here.

Thanks to Emilio Enriquez, Andrew Siemion and Steve Croft for their assistance on this, and to SETI’s general openness to letting citizen scientists play with such fascinating data.


Searching for ET using AI on GCP was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.