Rewriting Human DNA Using A.I.
Once the domain of Gods, Evolution, and but a small subset of humanity, writing and rewriting DNA is the one task which transcends perhaps all other biological technologies. It is a task dreamed of for eons. Those who came before us did not know the nature of the code with which they were written, but they knew a profoundly powerful force had written it. It is a task which promises great opportunity, great power, and perhaps even immortality itself.
I am delighted to announce the creation of the first A.I. capable of writing — and rewriting —human DNA. I created this A.I. — tentatively entitled SuperDNA — as a tool to explore the complex structure of DNA.
Humanity coded A.I., so it only seemed fitting that I teach A.I. to code humanity.
Put simply, this A.I. reads real human DNA, learns the patterns and structures within, and in so doing learns how to code out humanity itself. The patterns present in DNA correspond to the production of particular proteins. From those proteins we get the human organism. So in learning these patterns, the A.I. learns the most fundamental patterns of human biology.
Quite impressively, the SuperDNA A.I. learned about core structures and patterns in human DNA without any outside knowledge. That is, it seems to have learned how to encode human proteins, and thus humanity itself, without any outside knowledge.
A biologist might be able to point to a given study or chemical test to help determine whether a given part ‘AACCGGTT’ of a much longer DNA sequence is important itself. The SuperDNA A.I. has no such reference material. It only looked at long lists of DNA base pairs A, C, G, & T and from those lists was able to determine important patterns. Imagine the possibilities if SuperDNA was integrated with such outside data.
Why should we have A.I. rewrite DNA?
It must be supposed that someday, whether it be 10 years or 1000, humans will try to edit large portions if the human genome simultaneously. Some humans have already been gene-edited today. The problem is that humans are particularly bad at one class of scientific problem in particular: complex systems. I.e. systems with a multitude of intertwined parts. Seeing as how the human genome consists of 3 billion base pairs, it can very well be considered such a complex system.
Luckily for us A.I. happens to be much better at dealing with complex systems. Perhaps one day A.I.-generated DNA will be used to cure genetic disorders far too complex for humans to figure out.
To perhaps hasten such a process, here I make the A.I. look at 2 sets of genes related to cancer: BRCA (1 & 2) which is related to breast cancer, and BCL (2, 6 & 10) which is related to lymphoma.
Here’s the more technical material.
Lets have a look at the probability distributions of the generated DNA samples as compared to the DNA they were created from.
From this simple distribution we can observe a rough correspondence in the probabilities: A and T tend to be more common than C ang G. The A.I.-generated samples gen1 to gen4 are 20,000 characters in length. Sample gen5 is 50,000 characters and gen6 100,000. The real DNA sample is 631,330 characters. So this distribution can be considered to be fairly non-random. I.e. the A.I. is picking up on the fact that A and T seem to occur more in the real DNA it is based off of.
While this is somewhat encouraging we would still like stronger evidence that the A.I. is actually picking up on patterns in the DNA. For this task we turn to combinatorial analysis, analyzing what patterns appear in both the real and A.I.-generated DNA.
In service of this goal we look for combinations of base pairs of length n. For example, for n = 2 we look for how often AA, AC, AG, AT, CA, CC, CG, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT appear. For n = 3 we look for AAA, AAC, AAG, AAT, ACA, …. and so on until n = the length of the DNA itself. The number of such combinations grows as 4^n. Obviously this gets quite messy to look at each data point but we can still infer patterns by looking at the data overall.
Let me now present the n=5 combinatorial structure for such sequences. I.e. the amount of times AAAAA, AAAAC, …., TTTTG, TTTTT appear (for those wondering there are 4⁵ = 1024 such combinations). Here is the plot for our real DNA dataset:
Notice how many data points there are. We can also see that there are a lot of the combo TTTTT, which is represented by the spike at the end. Now let us compare this to similar plots for our A.I.-generated DNA:
Notice the similarities in structure highlighted in the comparison above. A large dip in values occurs roughly around the CG*** combination locations. This dip is even structured similarly in each graph:
Notably a spike separates the two valleys of this large dip. Also notable it the fact that this spike is slightly different in each graph, although it has the same overall structure in every graph.
This means that the A.I. is not only reproducing base pairs in a similar fashion to the real DNA, but that it actually resembles the original DNA structurally. That is, if a given pattern such as GGGTT appears in the real DNA we can expect our A.I. to generate the same or similar pattern at a relatively similar rate (effectively an efficient compression).
So our A.I. learned the patterns present in DNA. Patterns in DNA correspond to things produced by that DNA. So the A.I. has learned at least a little bit about the functions of life which that DNA encodes.
Most of the code and all of the data will be posted on my github, follow me there for code updates. Or follow me on twitter at @GDurendal