Rewriting Human DNA Using A.I.

5 min readOct 27, 2020

--

Photo by National Cancer Institute on Unsplash

Once the domain of Gods, Evolution, and but a small subset of humanity, writing and rewriting DNA is the one task which transcends perhaps all other biological technologies. It is a task dreamed of for eons. Those who came before us did not know the nature of the code with which they were written, but they knew a profoundly powerful force had written it. It is a task which promises great opportunity, great power, and perhaps even immortality itself.

I am delighted to announce the creation of the first A.I. capable of writing — and rewriting —human DNA. I created this A.I. — tentatively entitled SuperDNA — as a tool to explore the complex structure of DNA.

Humanity coded A.I., so it only seemed fitting that I teach A.I. to code humanity.

Put simply, this A.I. reads real human DNA, learns the patterns and structures within, and in so doing learns how to code out humanity itself. The patterns present in DNA correspond to the production of particular proteins. From those proteins we get the human organism. So in learning these patterns, the A.I. learns the most fundamental patterns of human biology.

Quite impressively, the SuperDNA A.I. learned about core structures and patterns in human DNA without any outside knowledge. That is, it seems to have learned how to encode human proteins, and thus humanity itself, without any outside knowledge.

A biologist might be able to point to a given study or chemical test to help determine whether a given part ‘AACCGGTT’ of a much longer DNA sequence is important itself. The SuperDNA A.I. has no such reference material. It only looked at long lists of DNA base pairs A, C, G, & T and from those lists was able to determine important patterns. Imagine the possibilities if SuperDNA was integrated with such outside data.

Why should we have A.I. rewrite DNA?

It must be supposed that someday, whether it be 10 years or 1000, humans will try to edit large portions if the human genome simultaneously. Some humans have already been gene-edited today. The problem is that humans are particularly bad at one class of scientific problem in particular: complex systems. I.e. systems with a multitude of intertwined parts. Seeing as how the human genome consists of 3 billion base pairs, it can very well be considered such a complex system.

Luckily for us A.I. happens to be much better at dealing with complex systems. Perhaps one day A.I.-generated DNA will be used to cure genetic disorders far too complex for humans to figure out.

To perhaps hasten such a process, here I make the A.I. look at 2 sets of genes related to cancer: BRCA (1 & 2) which is related to breast cancer, and BCL (2, 6 & 10) which is related to lymphoma.

Here’s the more technical material.

Lets have a look at the probability distributions of the generated DNA samples as compared to the DNA they were created from.

Probability that a given DNA base pair appears in a sequence. gen1 to gen6 are A.I.-generated DNA samples. The pink bar is the real DNA that they were generated from.

From this simple distribution we can observe a rough correspondence in the probabilities: A and T tend to be more common than C ang G. The A.I.-generated samples gen1 to gen4 are 20,000 characters in length. Sample gen5 is 50,000 characters and gen6 100,000. The real DNA sample is 631,330 characters. So this distribution can be considered to be fairly non-random. I.e. the A.I. is picking up on the fact that A and T seem to occur more in the real DNA it is based off of.

While this is somewhat encouraging we would still like stronger evidence that the A.I. is actually picking up on patterns in the DNA. For this task we turn to combinatorial analysis, analyzing what patterns appear in both the real and A.I.-generated DNA.

In service of this goal we look for combinations of base pairs of length n. For example, for n = 2 we look for how often AA, AC, AG, AT, CA, CC, CG, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT appear. For n = 3 we look for AAA, AAC, AAG, AAT, ACA, …. and so on until n = the length of the DNA itself. The number of such combinations grows as 4^n. Obviously this gets quite messy to look at each data point but we can still infer patterns by looking at the data overall.

Let me now present the n=5 combinatorial structure for such sequences. I.e. the amount of times AAAAA, AAAAC, …., TTTTG, TTTTT appear (for those wondering there are 4⁵ = 1024 such combinations). Here is the plot for our real DNA dataset:

n=5 combinatorial plot for our real DNA. The x-axis is labelled AAAAA, AAAAC, …., TTTTG, TTTTT. There are just so many data points that it is impossible to read each one.

Notice how many data points there are. We can also see that there are a lot of the combo TTTTT, which is represented by the spike at the end. Now let us compare this to similar plots for our A.I.-generated DNA:

n=5 combinatorial plots for our real DNA compared to the A.I.-generated DNA. Notice the outlined large dip. Other dips also appear in common locations, none as large as the one highlighted. Some differences also appear. Notably 2 and 3 have larger AAAAA occurrence, which is responsible for the spike at the beginning of those plots.

Notice the similarities in structure highlighted in the comparison above. A large dip in values occurs roughly around the CG*** combination locations. This dip is even structured similarly in each graph:

Note how 2 major cavities are contained in the red box. One before a spike, one wider one after a spike.

Notably a spike separates the two valleys of this large dip. Also notable it the fact that this spike is slightly different in each graph, although it has the same overall structure in every graph.

This means that the A.I. is not only reproducing base pairs in a similar fashion to the real DNA, but that it actually resembles the original DNA structurally. That is, if a given pattern such as GGGTT appears in the real DNA we can expect our A.I. to generate the same or similar pattern at a relatively similar rate (effectively an efficient compression).

So our A.I. learned the patterns present in DNA. Patterns in DNA correspond to things produced by that DNA. So the A.I. has learned at least a little bit about the functions of life which that DNA encodes.

Most of the code and all of the data will be posted on my github, follow me there for code updates. Or follow me on twitter at @GDurendal

GeorgeDavila/SuperDNA

SuperDNA github repository for code and data related to the SuperDNA A.I.

github.com

Rewriting Human DNA Using A.I.

GeorgeDavila/SuperDNA

SuperDNA github repository for code and data related to the SuperDNA A.I.

Written by George Davila Durendal