Publications and Talks

Also available on DBLP. If you encounter any difficulty locating a PDF, please contact me.

Conference Papers

2025
A (1+ε)-approximation for ultrametric embedding in subquadratic time with Gabriel Bathie · AAAI paper arXiv code

abstract

Efficiently computing accurate representations of high-dimensional data is essential for data analysis and unsupervised learning. Dendrograms, also known as ultrametrics, are widely used representations that preserve hierarchical relationships within the data. However, popular methods for computing them, such as algorithms, suffer from quadratic time and space complexity, making them impractical for large datasets. The best ultrametric embedding (a.k.a. ‘’best ultrametric fit’’) problem, which aims to find the ultrametric that best preserves the distances between points in the original data, is known to require at least quadratic time for an exact solution. Recent work has focused on improving scalability by approximating optimal solutions in subquadratic time, resulting in a \((\sqrt{2} + \epsilon)\)-approximation (Cohen-Addad, de Joannis de Verclos and Lagarde, 2021).
In this paper, we present the first subquadratic algorithm that achieves arbitrarily precise approximations of the optimal ultrametric embedding. Specifically, we provide an algorithm that, for any \(c \geq 1\), outputs a \(c\)-approximation of the best ultrametric in time \(\tilde{O}(n^{1 + 1/c})\). In particular, for any fixed \(\epsilon > 0\), the algorithm computes a \((1+\epsilon)\)-approximation in time \(\tilde{O}(n^{2 - \epsilon + o(\epsilon^2)})\).
Experimental results show that our algorithm improves upon previous methods in terms of approximation quality while maintaining comparable running times.
2025
Eco Search: A No-delay Best-First Search Algorithm for Program Synthesis with Théo Matricon, Nathanaël Fijalkow · AAAI · Invited for oral presentation paper

abstract

Many approaches to program synthesis perform a combinatorial search within a large space of programs to find one that satisfies a given specification. To tame the search space blowup, previous works introduced probabilistic and neural approaches to guide this combinatorial search by inducing heuristic cost functions. Best-first search algorithms ensure to search in the exact order induced by the cost function, significantly reducing the portion of the program space to be explored. We present a new best-first search algorithm called EcoSearch, which is the first constant-delay algorithm for pre-generation cost function: the amount of compute required between outputting two programs is constant, and in particular does not increase over time. This key property yields important speedups: we observe that eco outperforms its predecessors on two classic domains.
2022
Scaling Neural Program Synthesis with Distribution-Based Search with Nathanaël Fijalkow, Théo Matricon, Kevin Ellis, Pierre Ohlmann, Akarsh Nayan Potta · AAAI paper

abstract

We consider the problem of automatically constructing computer programs from input-output examples. We investigate how to augment probabilistic and neural program synthesis methods with new search algorithms, proposing a framework called distribution-based search. Within this framework, we introduce two new search algorithms: Heap Search, an enumerative method, and SQRT Sampling, a probabilistic method. We prove certain optimality guarantees for both methods, show how they integrate with probabilistic and neural techniques, and demonstrate how they can operate at scale across parallel compute environments. Collectively these findings offer theoretical and applied studies of search algorithms for program synthesis that integrate with recent developments in machine-learned program synthesizers.
2021
Improving Ultrametrics Embeddings Through Coresets with Vincent Cohen-Addad, Rémi de Joannis de Verclos · ICML paper

abstract

To tackle the curse of dimensionality in data analysis and unsupervised learning, it is critical to be able to efficiently compute “simple” faithful representations of the data that helps extract information, improves understanding and visualization of the structure. When the dataset consists of \(d\)-dimensional vectors, simple representations of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values) between data elements. To circumvent the quadratic running times of the most popular methods for fitting ultrametrics, such as average, single, or complete linkage, recently presented a new algorithm that for any \(c \ge 1\), outputs in time \(n^{1+O(1/c^2)}\) an ultrametric \(\Delta\) such that for any two points \(u, v\), \(\Delta(u, v)\) is within a multiplicative factor of \(5c\) to the distance between \(u\) and \(v\) in the “best” ultrametric representation. We improve the above result and show how to improve the above guarantee from \(5c\) to \(\sqrt{2}c + \varepsilon\) while achieving the same asymptotic running time. To complement the improved theoretical bound, we additionally show that the performances of our algorithm are significantly better for various real-world datasets.
2021
The complexity of learning linear temporal formulas from examples with Nathanaël Fijalkow · ICGI paper

abstract

In this paper we initiate the study of the computational complexity of learning linear temporal logic (LTL) formulas from examples. We construct approximation algorithms for fragments of LTL and prove hardness results; in particular we obtain tight bounds for approximation of the fragment containing only the next operator and conjunctions, and prove NP-completeness results for many fragments.
2020
Lower Bounds for Arithmetic Circuits via the Hankel Matrix with Nathanaël Fijalkow, Pierre Ohlmann, Olivier Serre · STACS paper

abstract

We study the complexity of representing polynomials by arithmetic circuits in both the commutative and the non-commutative settings. To analyse circuits we count their number of parse trees, which describe the non-associative computations realised by the circuit.
In the non-commutative setting a circuit computing a polynomial of degree d has at most 2^{O(d)} parse trees. Previous superpolynomial lower bounds were known for circuits with up to 2^{d{1/3-ε}} parse trees, for any ε > 0. Our main result is to reduce the gap by showing a superpolynomial lower bound for circuits with just a small defect in the exponent for the total number of parse trees, that is 2^{d{1 - ε}}, for any ε > 0.
In the commutative setting a circuit computing a polynomial of degree d has at most 2^{O(d log d)} parse trees. We show a superpolynomial lower bound for circuits with up to 2^{d{1/3 - ε}} parse trees, for any ε > 0. When d is polylogarithmic in n, we push this further to up to 2^{d{1 - ε}} parse trees.
While these two main results hold in the associative setting, our approach goes through a precise understanding of the more restricted setting where multiplication is not associative, meaning that we distinguish the polynomials (xy)z and x(yz). Our first and main conceptual result is a characterization result: we show that the size of the smallest circuit computing a given non-associative polynomial is exactly the rank of a matrix constructed from the polynomial and called the Hankel matrix. This result applies to the class of all circuits in both commutative and non-commutative settings, and can be seen as an extension of the seminal result of Nisan giving a similar characterization for non-commutative algebraic branching programs. Our key technical contribution is to provide generic lower bound theorems based on analyzing and decomposing the Hankel matrix, from which we derive the results mentioned above.
The study of the Hankel matrix also provides a unifying approach for proving lower bounds for polynomials in the (classical) associative setting. We demonstrate this by giving alternative proofs of recent lower bounds as corollaries of our generic lower bound results.
2020
On Efficient Low Distortion Ultrametric Embedding with Vincent Cohen-Addad, Karthik C. S. · ICML paper

abstract

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of \(n\) points in \(Ω(\log n)\) dimensions exhibit a quite prohibitive running time of \(Θ(n^2)\). In this paper, we provide a new algorithm which takes as input a set of points \(P\) in \(\mathbb{R}^d\), and for every \(c\ge 1\), runs in time \(n^{1+\fracρ{c^2}}\) (for some universal constant \(ρ>1\)) to output an ultrametric \(Δ\) such that for any two points \(u,v\) in \(P\), we have \(Δ(u,v)\) is within a multiplicative factor of \(5c\) to the distance between \(u\) and \(v\) in the “best” ultrametric representation of \(P\). Here, the best ultrametric is the ultrametric \(\tildeΔ\) that minimizes the maximum distance distortion with respect to the \(\ell_2\) distance, namely that minimizes \(\underset{u,v \in P}{\max}\ \frac{\tildeΔ(u,v)}{\|u-v\|_2}\). We complement the above result by showing that under popular complexity theoretic assumptions, for every constant \(\varepsilon>0\), no algorithm with running time \(n^{2-\varepsilon}\) can distinguish between inputs in \(\ell_\infty\)-metric that admit isometric embedding and those that incur a distortion of \(\frac{3}{2}\). Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
2020
Trade-Offs Between Size and Degree in Polynomial Calculus with Jakob Nordström, Dmitry Sokolov, Joseph Swernofsky · ITCS paper

abstract

Building on [Clegg et al. ’96], [Impagliazzo et al. ’99] established that if an unsatisfiable k-CNF formula over n variables has a refutation of size S in the polynomial calculus resolution proof system, then this formula also has a refutation of degree k + O(√(n log S)). The proof of this works by converting a small-size refutation into a small-degree one, but at the expense of increasing the proof size exponentially. This raises the question of whether it is possible to achieve both small size and small degree in the same refutation, or whether the exponential blow-up is inherent. Using and extending ideas from [Thapen ’16], who studied the analogous question for the resolution proof system, we prove that a strong size-degree trade-off is necessary.
2018
Lempel-Ziv: a “one-bit catastrophe” but not a tragedy with Sylvain Perifel · SODA paper

abstract

The so-called “one-bit catastrophe” for the compression algorithm LZ’78 asks whether the compression ratio of an infinite word can change when a single bit is added in front of it. We answer positively this open question raised by Lutz and others: we show that there exists an infinite word \(w\) such that \(ρ_{sup}(w)=0\) but \(ρ_{inf}(0w)>0\), where \(ρ_{sup}\) and \(ρ_{inf}\) are respectively the \(\limsup\) and the \(\liminf\) of the compression ratios \(ρ\) of the prefixes. To that purpose we explore the behaviour of LZ’78 on finite words and show the following results: - There is a constant \(C>0\) such that, for any finite word \(w\) and any letter \(a\), \(ρ(aw)\leq C\sqrt{ρ(w)\log|w|}\). Thus, sufficiently compressible words (\(ρ(w)=o(1/\log|w|)\)) remain compressible with a letter in front; - The previous result is tight up to a multiplicative constant for any compression ratio \(ρ(w)=O(1/\log|w|)\). In particular, there are infinitely many words \(w\) satisfying \(ρ(w)=O(1/\log|w|)\) but \(ρ(0w)=Ω(1)\).
2017
Lower Bounds and PIT for Non-Commutative Arithmetic Circuits with Restricted Parse Trees with Nutan Limaye, Srikanth Srinivasan · MFCS paper

abstract

We investigate the power of Non-commutative Arithmetic Circuits, which compute polynomials over the free non-commutative polynomial ring F<x_1,…,x_N>, where variables do not commute. We consider circuits that are restricted in the ways in which they can compute monomials: this can be seen as restricting the families of parse trees that appear in the circuit. Such restrictions capture essentially all non-commutative circuit models for which lower bounds are known. We prove several results about such circuits.
We show explicit exponential lower bounds for circuits with up to an exponential number of parse trees, strengthening the work of Lagarde, Malod, and Perifel (ECCC 2016), who prove such a result for Unique Parse Tree (UPT) circuits which have a single parse tree.We show explicit exponential lower bounds for circuits whose parse trees are rotations of a single tree. This simultaneously generalizes recent lower bounds of Limaye, Malod, and Srinivasan (Theory of Computing 2016) and the above lower bounds of Lagarde et al., which are known to be incomparable.We make progress on a question of Nisan (STOC 1991) regarding separating the power of Algebraic Branching Programs (ABPs) and Formulas in the non-commutative setting by showing a tight lower bound of n^{Omega(log d)} for any UPT formula computing the product of d n*n matrices.
When d <= log n, we can also prove superpolynomial lower bounds for formulas with up to 2^{o(d)} many parse trees (for computing the same polynomial). Improving this bound to allow for 2^{O(d)} trees would yield an unconditional separation between ABPs and Formulas.
- We give deterministic white-box PIT algorithms for UPT circuits over any field (strengthening a result of Lagarde et al. (2016)) and also for sums of a constant number of UPT circuits with different parse trees.

Journal Articles

2022
DeepSynth: Scaling Neural Program Synthesis with Distribution-based Search with Théo Matricon, Nathanaël Fijalkow, Kevin Ellis · J. Open Source Softw. paper
2021
Lower Bounds for Arithmetic Circuits via the Hankel Matrix with Nathanaël Fijalkow, Pierre Ohlmann, Olivier Serre · Comput. Complex. paper

abstract

We study the complexity of representing polynomials by arithmetic circuits in both the commutative and the non-commutative settings. To analyse circuits we count their number of parse trees, which describe the non-associative computations realised by the circuit.
In the non-commutative setting a circuit computing a polynomial of degree d has at most 2^{O(d)} parse trees. Previous superpolynomial lower bounds were known for circuits with up to 2^{d{1/3-ε}} parse trees, for any ε > 0. Our main result is to reduce the gap by showing a superpolynomial lower bound for circuits with just a small defect in the exponent for the total number of parse trees, that is 2^{d{1 - ε}}, for any ε > 0.
In the commutative setting a circuit computing a polynomial of degree d has at most 2^{O(d log d)} parse trees. We show a superpolynomial lower bound for circuits with up to 2^{d{1/3 - ε}} parse trees, for any ε > 0. When d is polylogarithmic in n, we push this further to up to 2^{d{1 - ε}} parse trees.
While these two main results hold in the associative setting, our approach goes through a precise understanding of the more restricted setting where multiplication is not associative, meaning that we distinguish the polynomials (xy)z and x(yz). Our first and main conceptual result is a characterization result: we show that the size of the smallest circuit computing a given non-associative polynomial is exactly the rank of a matrix constructed from the polynomial and called the Hankel matrix. This result applies to the class of all circuits in both commutative and non-commutative settings, and can be seen as an extension of the seminal result of Nisan giving a similar characterization for non-commutative algebraic branching programs. Our key technical contribution is to provide generic lower bound theorems based on analyzing and decomposing the Hankel matrix, from which we derive the results mentioned above.
The study of the Hankel matrix also provides a unifying approach for proving lower bounds for polynomials in the (classical) associative setting. We demonstrate this by giving alternative proofs of recent lower bounds as corollaries of our generic lower bound results.
2020
d-Galvin Families with Johan Håstad, Joseph Swernofsky · Electron. J. Comb. paper

abstract

The Galvin problem asks for the minimum size of a family \(\mathcal{F} \subseteq \binom{[n]}{n/2}\) with the property that, for any set \(A\) of size \(\frac n 2\), there is a set \(S \in \mathcal{F}\) which is balanced on \(A\), meaning that \(|S \cap A| = |S \cap \overline{A}|\). We consider a generalization of this question that comes from a possible approach in complexity theory. In the generalization the required property is, for any \(A\), to be able to find \(d\) sets from a family \(\mathcal{F} \subseteq \binom{[n]}{n/d}\) that form a partition of \([n]\) and such that each part is balanced on \(A\). We construct such families of size polynomial in the parameters \(n\) and \(d\).
2019
Lower Bounds and PIT for Non-commutative Arithmetic Circuits with Restricted Parse Trees with Nutan Limaye, Srikanth Srinivasan · Comput. Complex. paper

abstract

We investigate the power of Non-commutative Arithmetic Circuits, which compute polynomials over the free non-commutative polynomial ring F<x_1,…,x_N>, where variables do not commute. We consider circuits that are restricted in the ways in which they can compute monomials: this can be seen as restricting the families of parse trees that appear in the circuit. Such restrictions capture essentially all non-commutative circuit models for which lower bounds are known. We prove several results about such circuits.
We show explicit exponential lower bounds for circuits with up to an exponential number of parse trees, strengthening the work of Lagarde, Malod, and Perifel (ECCC 2016), who prove such a result for Unique Parse Tree (UPT) circuits which have a single parse tree.We show explicit exponential lower bounds for circuits whose parse trees are rotations of a single tree. This simultaneously generalizes recent lower bounds of Limaye, Malod, and Srinivasan (Theory of Computing 2016) and the above lower bounds of Lagarde et al., which are known to be incomparable.We make progress on a question of Nisan (STOC 1991) regarding separating the power of Algebraic Branching Programs (ABPs) and Formulas in the non-commutative setting by showing a tight lower bound of n^{Omega(log d)} for any UPT formula computing the product of d n*n matrices.
When d <= log n, we can also prove superpolynomial lower bounds for formulas with up to 2^{o(d)} many parse trees (for computing the same polynomial). Improving this bound to allow for 2^{O(d)} trees would yield an unconditional separation between ABPs and Formulas.
- We give deterministic white-box PIT algorithms for UPT circuits over any field (strengthening a result of Lagarde et al. (2016)) and also for sums of a constant number of UPT circuits with different parse trees.
2019
Non-commutative computations: lower bounds and polynomial identity testing with Guillaume Malod, Sylvain Perifel · Chic. J. Theor. Comput. Sci. paper

abstract

In the setting of non-commutative arithmetic computations, we define a class of circuits that generalize algebraic branching programs (ABP). This model is called unambiguous because it captures the polynomials in which all monomials are computed in a similar way (that is, all the parse trees are isomorphic). We show that unambiguous circuits of polynomial size can compute polynomials that require ABPs of exponential size, and that they are incomparable with skew circuits. Generalizing a result of Nisan on ABPs, we provide an exact characterization of the complexity of any polynomial in our model, and use it to prove exponential lower bounds for explicit polynomials such as the determinant. Finally, we give a white-box deterministic polynomial-time algorithm for polynomial identity testing (PIT) on unambiguous circuits over \(\mathbb{R}\) and \(\mathbb{C}\).
2017
De Bruijn-Erdős-type theorems for graphs and posets with Pierre Aboulker, David Malec, Abhishek Methuku, Casey Tompkins · Discret. Math. paper

abstract

A classical theorem of De Bruijn and Erdős asserts that any noncollinear set of n points in the plane determines at least n distinct lines. We prove that an analogue of this theorem holds for graphs. Restricting our attention to comparability graphs, we obtain a version of the De Bruijn-Erdős theorem for partially ordered sets (posets). Moreover, in this case, we have an improved bound on the number of lines depending on the height of the poset. The extremal configurations are also determined.

Preprints

2026
Analyzing and Leveraging the k-Sensitivity of LZ77 with Gabriel Bathie, Paul Huber, Akka Zemmari arXiv

abstract

We study the sensitivity of the Lempel-Ziv 77 compression algorithm to edits, showing how modifying a string \(w\) can deteriorate or improve its compression. Our first result is a tight upper bound for \(k\) edits: \(\forall w' \in B(w,k)\), we have \(C_{\mathrm{LZ77}}(w') \leq 3 \cdot C_{\mathrm{LZ77}}(w) + 4k\). This result contrasts with Lempel-Ziv 78, where a single edit can significantly deteriorate compressibility, a phenomenon known as a one-bit catastrophe. We further refine this bound, focusing on the coefficient \(3\) in front of \(C_{\mathrm{LZ77}}(w)\), and establish a surprising trichotomy based on the compressibility of \(w\). More precisely we prove the following bounds: - if \(C_{\mathrm{LZ77}}(w) \lesssim k^{3/2}\sqrt{n}\), the compression may increase by up to a factor of \(\approx 3\), - if \(k^{3/2}\sqrt{n} \lesssim C_{\mathrm{LZ77}}(w) \lesssim k^{1/3}n^{2/3}\), this factor is at most \(\approx 2\), - if \(C_{\mathrm{LZ77}}(w) \gtrsim k^{1/3}n^{2/3}\), the factor is at most \(\approx 1\). Finally, we present an \(\varepsilon\)-approximation algorithm to pre-edit a word \(w\) with a budget of \(k\) modifications to improve its compression. In favorable scenarios, this approach yields a total compressed size reduction by up to a factor of~\(3\), accounting for both the LZ77 compression of the modified word and the cost of storing the edits, \(C_{\mathrm{LZ77}}(w') + k \log |w|\).
2025
ParamExplorer: A framework for exploring parameters in generative art with Julien Gachadoat arXiv code

abstract

Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regions. Because of this combinatorial explosion, artists typically rely on extensive manual trial-and-error, leaving many potentially interesting configurations undiscovered. In this work we make two contributions. First, we introduce ParamExplorer, an interactive and modular framework inspired by reinforcement learning that helps the exploration of parameter spaces in generative art algorithms, guided by human-in-the-loop or even automated feedback. The framework also integrates seamlessly with existing p5js projects. Second, within this framework we implement and evaluate several exploration strategies, referred to as agents.
2023
Learning temporal formulas from examples is hard with Corto Mascle, Nathanaël Fijalkow arXiv

abstract

We study the problem of learning linear temporal logic (LTL) formulas from examples, as a first step towards expressing a property separating positive and negative instances in a way that is comprehensible for humans. In this paper we initiate the study of the computational complexity of the problem. Our main results are hardness results: we show that the LTL learning problem is NP-complete, both for the full logic and for almost all of its fragments. This motivates the search for efficient heuristics, and highlights the complexity of expressing separating properties in concise natural language.
2018
Tight Bounds using Hankel Matrix for Arithmetic Circuits with Unique Parse Trees with Nathanaël Fijalkow, Pierre Ohlmann arXiv

abstract

This paper studies lower bounds for arithmetic circuits computing (non-commutative) polynomials. Our conceptual contribution is an exact correspondence between circuits and weighted automata: algebraic branching programs are captured by weighted automata over words, and circuits with unique parse trees by weighted automata over trees. The key notion for understanding the minimisation question of weighted automata is the Hankel matrix: the rank of the Hankel matrix of a word or tree series is exactly the size of the smallest weighted automaton recognising this series. For automata over words, the correspondence we establish allows us to rephrase Nisan’s celebrated tight bounds for algebraic branching programs. We extend this result by considering automata over trees and obtain the first tight bounds for all circuits with unique parse trees.

Thesis

2018
Contributions to Arithmetic Complexity and Compression PhD thesis, Université Paris Diderot pdf

abstract

This thesis explores two territories of computer science: complexity and compression. More precisely, in a first part, we investigate the power of non-commutative arithmetic circuits, which compute multivariate non-commutative polynomials. For that, we introduce various models of computation that are restricted in the way they are allowed to compute monomials. These models generalize previous ones that have been widely studied, such as algebraic branching programs. The results are of three different types. First, we give strong lower bounds on the number of arithmetic operations needed to compute some polynomials such as the determinant or the permanent. Second, we design some deterministic polynomial-time algorithm to solve the white-box polynomial identity testing problem. Third, we exhibit a link between automata theory and non-commutative arithmetic circuits that allows us to derive some old and new tight lower bounds for some classes of non-commutative circuits, using a measure based on the rank of a so-called Hankel matrix. A second part is concerned with the analysis of the data compression algorithm called Lempel-Ziv. Although this algorithm is widely used in practice, we know little about its stability. Our main result is to show that an infinite word compressible by LZ’78 can become incompressible by adding a single bit in front of it, thus closing a question proposed by Jack Lutz in the late 90s under the name “one-bit catastrophe”. We also give tight bounds on the maximal possible variation between the compression ratio of a finite word and its perturbation—when one bit is added in front of it.

A few slides

A few posters

Popular Science

Henri Potier à l’école de la complexité Le blog binaire, Le Monde · avec Sylvain Perifel Chapitre 1 Chapitre 2
Un seul bit vous manque, et ça ne compresse plus… INS2I CNRS
La théorie de la complexité algorithmique Interstices