Feb 9, 2024 16 min read

Neural Cryptanalysis – An Introduction to Attacking Cryptography with Deep Learning

Roger A. Hallman

1 Introduction

Machine learning and cryptography are two fields of technology that are fundamental to many of the fabric of modern life. In fact, we interact almost unknowingly with both of these technologies throughout our daily life. Given this ubiquity, the average person who hasn’t had taken a deep dive into computer science and mathematics might be surprised to learn that these fields have largely evolved in isolation of one another. Cryptographers and machine learning researchers are members of your local university’s computer science department; and while they know each other, they have been unlikely to collaborate. The disciplines have separate journals and conferences (e.g., Cryptology and Eurocrypt for cryptographers, ICML and ICLR for machine learning researchers). In spite of surface level differences between these fields of mathematics and computer science, the two have interesting similarities and relationships (noted by no less than Ron Rivest [18]).

In fact, cryptography and machine learning are increasingly coming together thanks to recent advances in encrypted computation. Refereed literature detailing the results of research on encrypted learning are becoming much more commonplace, appearing in top-tier conferences and journals. Moreover, there are now open source software libraries available that support encrypted learning, and there are even a few for-profit businesses offering privacy preserving machine learning services. However, the explosive growth of interest in encrypted learning is focused primarily on protecting private data, and is worlds away from another intuitive intersection of machine learning and cryptography: cryptanalysis, the science of breaking cryptographic systems.

Cryptography is used for making ordered data resemble random noise, while machine learning excels at discerning patterns and order from seemingly random data. There is a small-but-growing body of literature delving into this topic, and this article serves to introduce the reader to the field of neural cryptanalysis. The duration of the article is organized as follows: Background information is provided in Section 2, neural cryptanalysis is defined in Section 3 with selected results presented in Section 4, and concluding remarks are given in Section 5.

2 Background

Cryptography, cryptanalysis, machine and deep learning are complicated topics for which I cannot possibly do justice in an article section. Readers wishing to take a deeper dive into these topics (that is, to the depth of an advanced undergraduate or early graduate student) are encouraged to read through the following recommended sources (from which this background section is drawn):

• There are quite a few textbooks on cryptography, the most widely used in academia is probably Katz and Lindell’s Introduction to Modern Cryptography [10].
• There are far fewer textbooks available on cryptanalysis, however Stamp and Low’s Applied Cryptanalysis [20] is the best resource that I am aware of.
• As with the topic of cryptography, there are a multitude of excellent textbooks on statistical machine learning; I might direct the reader to Mohri, et al.’s Foundations of Machine Learning [16]. Quality texts on the subset of machine learning called deep learning are becoming more commonplace, and my favorite introductory work is Aggarwal’s Neural Networks and Deep Learning [1]. However, for the reader who is particularly courageous in the face of daunting mathematics, I would include Roberts, et al.’s Principles of Deep Learning Theory [19].

2.1 Cryptography & Cryptanalysis

We will begin our background section by covering cryptography and cryptanalysis. Cryptography is the study of mathematical techniques that can secure digital information, systems, and distributed computations against adversarial attacks. Cryptography has been practiced since ancient times, indeed the first cryptographic cipher that students are exposed to is known as the “Caesar Cipher.” The Caesar Cipher is representative of a class of ciphers known as a shift (or substitution) cipher, where letters are shifted forwards or backwards according to some predetermined value referred to as a key. That key value k is important, and there is a maxim among cryptographers known as Kerckhoffs’ Principle that a cipher’s security relies solely on the secrecy of k. An encryption algorithm ENC_k(m) = c takes a plaintext message m and produces a ciphertext c, while a corresponding decryption algorithm DEC_k(c) = m takes the ciphertext c and produces the original message m. Symmetric cryptographic ciphers use the same key value k for both encryption and decryption, while asymmetric ciphers use a public key k_pubfor encryption and a secret key k_sec for decryption.

Different ciphers are used in different scenarios. For instance, data-at-rest will typically be encrypted via a symmetric scheme, while data-in-transit will be encrypted with an asymmetric scheme. Encrypted computation is a recent field where data-in-use can enable operations on encrypted data; the main cryptographic primitives that make this possible are multiparty computation [6] and homomorphic encryption [5].

Cryptographic ciphers are used to keep information secret, so it makes sense that there are attacks on these systems. Cryptanalysis is the formal study of attacking and defeating cryptographic ciphers, thus revealing their weaknesses. Successfully (or unsuccessfully) attacking ciphers gives cryptographers important insights to strengthen their designed systems. Common attacks include brute force, frequency and statistical analysis, side channel analysis, as well as the discovery and exploitation of vulnerabilities in software or engineered systems. (We will not be dealing with the latter two methods in this article, as the fall within the purview of other disciplines, e.g., Reverse Engineering and Cybersecurity Testing.)

Different cryptanalysis attacks exist to test ciphers. Symmetric ciphers are often attacked by a method known as differential cryptanalysis, where differences are observed in how the plaintext propagates through the cipher to produce different ciphertexts. Once enough (plaintext, ciphertext) pairs have been observed, a cryptanalyst can deduce information about the cipher, possibly even recovering the key value k. Asymmetric ciphers rely on hardness assumptions of certain mathematical problems (e.g., factoring large numbers or computing discrete logarithms over finite fields). For example, the key values k_pub, k_secare functions of a composite number N = p × q where p, q are prime numbers. This kind of integer factorization is believed to be hard and cannot be performed efficiently; however, it is expected that the ascendance of quantum computing will bring algorithms that can solve these hard problems in polynomial time (a measure of computational complexity). There is a race to develop cryptographic schemes that will remain resilient against quantum attacks; the National Institute for Standards and Technology (NIST)¹is one of many organizations driving the development of such schemes. Post-quantum cryptanalysis is an active and evolving field of research.

2.2 Machine Learning

Machine learning (ML) is a subfield of artificial intelligence which seeks to enable computer programs to improve task performance with experience (i.e. exposure to data). Classical ML algorithms utilize a great deal of statistical methods, which makes results very explainable. Deep learning (DL) is a subset of ML which uses architecture models inspired by human neurological systems. DL systems are often opaque in their operation and consequently will reach conclusions by processes which are not easily explainable or understandable, even for knowledgeable computer scientists. This article will focus on DL and neural networks, rather than classical ML.

Simple feed-forward neural networks, where each neuron in a layer is connected to every neuron in the next layer, are known as universal approximators [7, 13]. Convolutional neural networks (CNNs) apply convolution operations on data with grid-like structure to learn patterns and extract features; these networks are often used in image classification tasks. Recurrent neural networks (RNNs) use hidden layers to discover temporal dependencies, and consequently excel at sequential data processing tasks (e.g. sequential analysis and natural language processing). Finally, attention mechanisms and large language models (LLMs) have been taking the world by storm. These state-of-the-art architectures allow for improved flexibility and more context aware information extraction, and thus are key to most recent advancements in artificial intelligence.

3 Defining Neural Cryptanalysis

We are now in a position to discuss the use of DL for cryptanalysis tasks. Neural cryptanalysis is an approach that leverages DL to measure cryptographic cipher strength. Formally, a neural cryptanalysis system [24] a tuple (M₁, M₂, N, S) where M₁and M₂are mutually exclusive, finite sets of plaintext-ciphertext pair (p, c) that satisfy
ENC_k(p) = c, p ∈ Z^m₂and c ∈ Zⁿ₂. Z₂denotes the binary value space {0, 1} and Z^t₂represents a binary stream of length t. N is a set of neural networks trained by M₁and tested by M₂, and S is a measure of cipher strength generated by training and testing the neural networks in N.

We can further define neural cryptanalysis into subclassifications, based on how much of a role DL plays in the attack. Indirect neural cryptanalysis refers to attacks where DL capabilities serve in more of a supporting role. Cryptanalysis attacks where DL models are trained to directly recover cryptographic secrets are referred to as direct neural cryptanalysis. Finally, large language model cryptanalysis refers to attacks that utilize LLMs to attack a hard mathematical problem which underpin some current post-quantum cryptography schemes. Examples of each of these subclassifications will be given in Section 4, and GitHub repositories will be linked where public code bases are available. Neural cryptanalysis has been successfully used in attacks on classical and symmetric ciphers; however, there are well-known results [11] which demonstrate that neural cryptanalysis is not an efficient attack against common asymmetric cryptography schemes.

4 Selected Results in Neural Cryptanalysis Literature

Having provided a sufficient background and given a formal introduction to neural cryptanalysis, we are now ready to dive into some results. We will first acquaint ourselves with a few proofs of concept neural cryptanalysis attacks against classic ciphers that aren’t mathematically sophisticated (at least in comparison to modern ciphers). After these introductory case studies, we will turn our attention to attacks against modern cryptographic ciphers. First, we will cover two attacks against block ciphers and an attack against a stream cipher. We will then study LLM-based cryptanalysis against the Learning With Errors (LWE) Problem, on which several post-quantum cryptography schemes are built.

4.1 Classical Cipher Neural Cryptanalysis

Classical ciphers, whether shift or substitution ciphers, deterministically transform a plaintext according to some key value. A shift cipher might uniformly add 3 mod 26 so that the plaintext CAT LABS becomes ciphertext FDW ODEV. A Vigenere cipher uses shifts that vary as plaintext elements through the key value. Given a key value of “CATLabs”, the plaintext Neural Cryptanalysis maps to the ciphertext Pencam Utyieaosnylts.

In some sense, we can think of neural cryptanalysis against these classical ciphers as a language translation problem. Gomez, et al.²[9] used this approach to determine whether a neural network could be trained to deduce withheld ciphers from unaligned text, and without supplementation of preexisting human knowledge. The authors used a generative deep learning [21], specifically modifying the CycleGAN [25] architecture to make it stable over discrete data sets, a technique dubbed “reliable fooling via relaxation.” Using this approach, they were able to successfully decrypt shift and Vigenere ciphers with about 75% accuracy. Another interesting result coming out of machine translation research where Aldarrab and May [3] used a transformer-based [22] Seq2Seq model combined with a “frequency encoder” to decipher 700 historical substitution ciphers in 14 languages.

4.2 Attacking Block Ciphers with Neural Distinguishers

Block ciphers use pseudorandom permutations to encrypt blocks of plaintext and are one of the most commonly cryptographic technologies in use today. These symmetric ciphers provide fast and computationally efficient performance, and are often used for protecting data at rest; however, key exchange protocols can make the use of block ciphers a feasible option for protecting data in transit.

Gohr [8] used CNNs augmented with a Bayesian optimizer to construct a neural distinguisher³, a classifier that can distinguish real ciphertexts from randomly generated data. Gohr ran this attack against the Speck32/64 cipher, a lightweight block cipher with 32−bit blocks and a 64−bit key. The Speck/m cipher was designed for use in Internet of Things devices, and can support various block and key sizes. Even in a scenario where perfect knowledge of the differential distribution didn’t allow an attacker to perform better than random guessing, Gohr was able to derive sets of candidate keys after a 9-round distinguisher attack. Only two additional rounds of distinguisher attacks were necessary to recover keys. These neural distinguisher attacks successfully recovered the key value at about 3-times the rate of classical distinguisher attacks.

4.3 Neural Cryptanalysis Against Stream Ciphers

In addition to the block cipher in Section 4.2, there is another class of symmetric cipher known as a stream cipher. Stream ciphers encrypt one bit (or byte) at a time, producing a stream of bits. These ciphers are lightweight and efficient for wireless communications, streaming media, and real-time encryption tasks.

Xiao, et al. [24] developed the neural cryptanalysis system described in Section 3, and attempted to mimic the output from the Hitag2 cipher. Hitag2 is a proprietary cipher, which means that its inner workings are somewhat opaque; however, there is an understanding that it implements a 48-bit Linear Feedback Shift Register (LFSR) and nonlinear Boolean functions. The cipher was used in automotive scenarios, particularly for keyless entry mechanisms. They used a fully-connected, feed forward network with four hidden layers and of bounded width to mimic the Hitag2 cipher with approximately 67% success. (The authors also used a fully-connected, feed forward network with a single hidden layer to mimic a reduced round Data Encryption Standard block cipher with 99.7% accuracy.)

4.4 Large Language Model-based Neural Cryptanalysis

LLMs are proving to be versatile architectures which excel in a multitude of diverse tasks. We will now explore a fascinating application of LLMs against a hard mathematical problem upon which several quantum-resistant cryptography schemes are built.

4.4.1 The Learning With Errors Problem

The Learning With Errors (LWE) Problem [17] seeks to recover a secret value
s ∈ Zⁿ_qgiven a sequence of ‘approximate’ random linear equations on s. There are samples (a, b) where a is a random vector of n integers modq, and b is a noisy modular inner product of a and s, and ⟨a, s⟩ + e = b, where e is an error value drawn from some distribution. LWE or its variants are underpin all of the current Fully Homomorphic Encryption (FHE) schemes, as well as the CRYSTALS-Kyber and CRYSTALS-Dilithium post-quantum cryptography schemes that NIST is currently standardizing. There is no known, polynomial time classical or quantum attack against LWE. (In other words, LWE is as computationally hard in the average case as in the worst case.) As an illustrative example, consider the following [17]:

14s₁+ 15s₂+ 5s₃+ 2s4 ≈ 8 mod 17
13s₁+ 14s₂+ 13s₃+ 6s₄≈ 16 mod 17
6s₁+ 10s₂+ 13s₃+ 1s₄≈ 3 mod 17
10s₁+ 4s₂+ 12s₃+ 16s₄≈ 12 mod 17
9s₁+ 5s₂+ 9s₃+ 6s₄≈ 9 mod 17
3s₁+ 6s₂+ 4s₃+ 5s₄≈ 16 mod 17
^…
6s₁+ 7s₂+ 16s₃+ 2s₄≈ 3 mod 17

The answer in this case is s = (0, 13, 9, 11). In essence, the LWE problem hides a secret value in noisy inner products and solving can be thought of as solving a modular least squares regression.

The LWE problem is sometimes referred to as Search LWE. There are couple of additional LWE variants, including:

• Ring LWE [14], where LWE is applied to rings over finite fields;
• Decision LWE [15], the task is to distinguish the LWE distribution from the uniform distribution {(a, b) : a, b are chosen uniformly at random}.

The secret vector s can be samples from many distributions, however many schemes (FHE schemes in particular) use binary distributions (i.e. sampling in {0, 1}ⁿ) or ternary distributions (i.e. sampling in {−1, 0, 1}ⁿ). It is common to see sparse secrets with Hamming weight h.

4.4.2 Attacking LWE with LLMs

The factor that makes attacking LWE particularly hard is the use of modular arithmetic. Recovering s would be trivial without this feature, as machine learning models are robust to noise in training data. Modular arithmetic is known to be a hard problem for machine learning, and transformers initially struggled to compute simple arithmetic operations. However, Charton [4] recently demonstrated that transformers could compute matrix-vector products–the primary mathematical operation in LWE.

Wenger, et al. [23] utilized this advance to train a transformer to compute modular inner products. They began with a one-dimensional case with the goal of predicting b = a · s mod q with at least 95% accuracy, where s is unknown and a · s ∈ Z_q. They achieved this goal, though the required number of training samples was between 10q and 50q and the efficacy of the training process was dependent on q. Having achieved this base result, they moved onto experimentation of multi-dimensional cases: predicting b = ⟨a, s⟩ mod q where a ∈ Zⁿ_qand s ∈ Zⁿ_qor {0, 1}ⁿ(i.e. a binary secret). These experiments were able to achieve 90% accuracy with n = 2 and relatively small moduli values of q ≤ 967, but accuracy fell to approximately 30% when the modulus value was raised to q = 1471. Simply stated, learning the modular inner product between a ∈ Zⁿ_qand s ∈ Zⁿ_qis a significantly more difficult task.

This attack, dubbed Secret Recovery Attacks on LWE via Seq2Seq Models (SALSA)⁴, demonstrates the potential of this approach to testing the strength of lattice-based, post-quantum cryptography schemes. Assuming that they have access to a sufficient number of LWE pairs in n dimensions and using the same secret value s, the attack has three distinct components:

1. a transformer model M that is trained to predict b from a;
2. a secret recovery algorithm which feeds special values of a into M, and uses ˜b = M(a) to predict the secret value ˜s;
3. a secret verification procedure which checks that the residual value r = b − ⟨a, ˜s⟩ mod q has a small standard deviation σ.

The model M predicts b from a by minimizing cross-entropy between a predicted value b^′and the ground truth value b. There are two secret recovery algorithms, so the attack can either utilize (i) a direct secret recovery that is akin to a chosen plaintext attack, or (ii) a distinguisher secret recovery which is based on the Decision LWE problem. Finally, the attack must verify whether the secret prediction ˜s = s. When ˜s = s, then the residual value is equal to the noise and r = b − ⟨a, ˜s⟩ = b − ⟨a, s⟩ = e mod q with a small σ. When ˜s ̸= s, then r shows an approximately uniform distribution in Z_q.

The SALSA attack was able to recover the secret s up to dimension n = 128, an impressive result for a proof-of-concept attack. However, it is important to understand that lattice-based cryptographic schemes use much larger dimensions (for example, the FHE Security Standard [2] recommends dimension n ≥ 1024 and gives parameter recommendations up to n = 32768). Follow on work⁵ expanded on these results [12], successfully recovering s up to dimension n = 512, as well as a number other improvements.

5 Conclusion

This article is meant to have given the reader an introduction to the topic of neural cryptanalysis, the attacking and testing of cryptographic ciphers utilizing deep learning. We have covered attacks against classical ciphers, as well as more current symmetric ciphers and Learning With Errors, a hard lattice problem that is important in the context of secure computation and post-quantum cryptography. Moreover, Git repositories are shared where the referenced authors have made their research publicly available, and so the reader is able to explore the advancements in this field as deeply as they like.

Nevertheless, this line of work raises numerous questions. One is tempted to ask whether we are on the cusp of a viable deep learning-based attack against already deployed cryptographic systems. Are certain cryptographic schemes more vulnerable to these types of neural cryptanalysis attacks? Are cryptographic use cases (e.g. cryptocurrencies) safe from neural cryptanalysis attacks? Can these capabilities be weaponized?

The good news is that the publicly available information suggests that deep learning is not yet a threat to currently deployed cryptographic systems. (However, deep learning-based side channel attacks, not covered in this article, may well circumvent cryptographic protections and infer sensitive information). Moreover, the attacks described in this paper were against symmetric, rather than asymmetric, ciphers. Kearns and Valiant [11] demonstrated the difficulty of learning Boolean functions, specifically working with the RSA cipher and the infeasibility of learning cryptographic operations; this suggests that asymmetric cryptography is safe for the foreseeable future (including the cryptocurrency ecosystem). Moreover, the cryptographic ciphers that are in real-world service are sophisticated systems which often have additional protections than the raw cipher. In other words, deep learning is not currently a threat to cryptography, but that may change.

¹https://csrc.nist.gov/projects/post-quantum-cryptography
²https://github.com/for-ai/CipherGAN
³https://github.com/agohr/deep_speck
⁴https://github.com/facebookresearch/SALSA
⁵https://github.com/facebookresearch/verde

References

[1] Aggarwal, C. Neural networks and deep learning: a textbook. Springer, 2018.

[2] Albrecht, M., Chase, M., Chen, H., Ding, J., Goldwasser, S., Gorbunov, S., Halevi, S., Hoffstein, J., Laine, K., Lauter, K., et al. Homomorphic encryption standard. Protecting privacy through homomorphic encryption (2021), 31–62.

[3] Aldarrab, N., and May, J. Can sequence-to-sequence models crack substitution ciphers? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021), pp. 7226–7235.

[4] Charton, F. Linear algebra with transformers. Transactions on Machine Learning Research (2022).

[5] Cheon, J. H., Costache, A., Moreno, R. C., Dai, W., Gama, N., Georgieva, M., Halevi, S., Kim, M., Kim, S., Laine, K., et al. Introduction to homomorphic encryption and schemes. Protecting Privacy through Homomorphic Encryption (2021), 3–28.

[6] Cramer, R., Damg˚ard, I. B., et al. Secure multiparty computation and secret sharing. Cambridge University Press, 2015.

[7] Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.

[8] Gohr, A. Improving attacks on round-reduced speck32/64 using deep learning. In Annual International Cryptology Conference (2019), Springer, pp. 150–179.

[9] Gomez, A. N., Huang, S., Zhang, I., Li, B. M., Osama, M., and Kaiser, L. Unsupervised cipher cracking using discrete GANs. In International Conference on Learning Representations (2018).

[10] Katz, J., and Lindell, Y. Introduction to modern cryptography. CRC press, 2020.

[11] Kearns, M., and Valiant, L. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM) 41, 1 (1994), 67–95.

[12] Li, C., Sotakova, J., Wenger, E., Allen-Zhu, Z., Charton, F., and Lauter, K. Salsa verde: a machine learning attack on learning with errors with sparse small secrets. arXiv preprint arXiv:2306.11641 (2023).

[13] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. Advances in neural information processing systems 30 (2017).

[14] Lyubashevsky, V., Peikert, C., and Regev, O. A toolkit for ring-lwe cryptography. In Advances in Cryptology–EUROCRYPT 2013: 32nd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Athens, Greece, May 26-30, 2013. Proceedings 32 (2013), Springer, pp. 35–54.

[15] Micciancio, D., and Mol, P. Pseudorandom knapsacks and the sample complexity of lwe search-to-decision reductions. In Annual cryptology conference (2011), Springer, pp. 465–484.

[16] Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

[17] Regev, O. The learning with errors problem. Invited survey in CCC 7, 30 (2010), 11.

[18] Rivest, R. L. Cryptography and machine learning. In International Conference on the Theory and Application of Cryptology (1991), Springer, pp. 427–439.

[19] Roberts, D. A., Yaida, S., and Hanin, B. The principles of deep learning theory. Cambridge University Press Cambridge, MA, USA, 2022.

[20] Stamp, M., and Low, R. M. Applied cryptanalysis: breaking ciphers in the real world. John Wiley & Sons, 2007.

[21] Tomczak, J. M. Deep Generative Modeling. Springer Cham, 2021.

[22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems 30 (2017).

[23] Wenger, E., Chen, M., Charton, F., and Lauter, K. E. Salsa: Attacking lattice cryptography with transformers. Advances in Neural Information Processing Systems 35 (2022), 34981–34994.

[24] Xiao, Y., Hao, Q., and Yao, D. D. Neural cryptanalysis: metrics, methodology, and applications in cps ciphers. In 2019 IEEE conference on dependable and secure computing (DSC) (2019), IEEE, pp. 1–8.

[25] Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2223–2232.