Welcome. My first post is inspired by the physicist and blogger Scott Aaronson, who recently blogged his criticisms about a theory I’ve been working on, called causal emergence. To see the simple nature of his error, skip down to Isn’t causal emergence just an issue of normalization?, although this does assume you are familiar with some of the theory's terminology. Since Scott's criticisms reflected a lack of understanding of the theory, it prompted me to do this generalized explainer. Please note this explainer is purposefully designed to not be technical, formalized, or comprehensive. Its goal is to give interested parties a conceptual grasp on the theory, using relatively basic notions of causation and information. What’s causal emergence? It’s when the higher scale of a system has more information associated with its causal structure than the underlying lower scale. Causal structure just refers to a set of causal relationships between some variables, such as states or mechanisms. Measuring causal emergence is like you're looking at the causal structure of a system with a camera (the theory) and as you focus the camera (look at different scales) the causal structure snaps into focus. Notably, it doesn’t have to be “in focus" at the lowest possible scale, the microscale. Why is this? In something approaching plain English: macrostates can be strongly coupled even while their underlying microstates are only weakly coupled. The goal of the theory is to search across scales until the scale at which variables (like elements or states) are most strongly causally coupled pops out. Isn’t this against science, reductionism, or physicalism? Nope. The theory adheres to something called supervenience. That’s a technical term that means: if everything is fixed about the lower scale of a system, everything about the higher scale must follow. So there’s nothing spooky, supernatural, or anti-physicalist about the results. Rather, the theory provides a toolkit to identify appropriate instances of causal emergence or reduction depending on the properties of the system under consideration. It just means that, when thinking about causation, reductionism isn’t always best. The higher scales the theory considers are things like coarse-grains (groupings of states or mechanisms) or leaving states or elements out of the system, among others. These aren’t supernatural, just different levels of description, some of which capture the real causal structure (the coupling between variables) better. In this sense, the theory says that causal interpretations are not relative or arbitrary but instead are constrained by constraint. How do you analyze causal structure? Causation has long been considered a philosophical subject, even though it’s at the heart of science in the form of experiments and separating correlation from causation. Causation, much like information, can actually be formalized abstractly and mathematically. For instance, in the 90s and 2000s, a researcher named Judea Pearl introduced something called the do(x) operator. The idea is to formalize causation by modeling the interventions an experimenter makes on a system. Let’s say I want to check if there is a causal relationship between a light switch and a light bulb in a room. Formally, I would do(light switch = up) at some time t, and observe the effects on the bulb at some time t+1. One of the fundamental ways of analyzing causation is what’s called an A/B test, or a randomized trial. For two variables A and B, you randomize those variables and observe the outcome of your experiment. Think of it like injecting noise into the experiment, which then tells you which of those two variables is more effective at producing the outcome. For example, let’s say the light bulb flickers into the {off} state while the light switch is in the {up} state 20% of the time. If you do(light switch = up) and then do(light switch = down) at in a 50/50 manner, it reveals the effects of the states. From this, you can construct something (using Bayes’ theorem) called a transition table: Note that this reflects the actual causal structure. Flipping the switch {up} really does cause the bulb to turn {on} 80% of the time. Doing the A/B test appropriately screened out everything but the conditional probabilities between the states, such as how often you flipped the switch. And, ultimately, causal relationships are conditional. They aren’t about the probabilities of the states themselves, but about “if x then y” classes of statements. Of course, we can also do A/B/C tests, and so on. What matters is randomizing over everything (creating an independent noise source) so that the result exposes the conditional probabilities between the states. The theory of causal emergence formalizes this as applying an intervention distribution: a probability distribution of do(x) operators. The intervention distribution that corresponds to an A/B test would be [1/2 1/2], and if A/B/C, then [1/3 1/3 1/3]. This is called a maximum entropy, or uniform, distribution. How does information theory relate to causal structure? Consider two variables, X and Y. We want to assess the causal influence X has over Y, X → Y. Assume, for now, there are no other effects on Y. If every change in the state of X is followed by a change in the state of Y, then the state of X contains a lot of causal information about Y. So if Y is very sensitive to the state of X, a metric of causal influence should be high. Note this is different than predictive information. You might be able to predict Y given X even if changes in X don’t lead to changes in Y (like how sales of swim wear in June could predict sales of air conditioners in July). To be more formal about assessing X → Y, we inject noise into X and observe the effects on Y. Effective information, or EI, is the mutual information I(X;Y) between X and Y while intervening to set X to maximum entropy (inject noise). Note that this is the same as applying a uniform intervention distribution over the states of X. This is a measure of what can be called influence, or coupling, or constraint. Beyond capturing in an intuitive way how Y is causally coupled to the state of X, here are a few additional reasons that the metric is appropriate: i) setting X to maximum entropy screens off everything but the conditional probabilities, ii) it’s like doing an experimental randomized trial, or A/B test, without prior knowledge of the effects on Y of X’s states, iii) it doesn’t leave anything out, so if a great many states of X don’t impact Y, this will be reflected in the causal influence of X on Y, iv) it’s like injecting the maximum amount of experimental information into X, Hmax(X), in order to see how much of that information is reflected in Y, v) the metric can be derived from the cause/effect information of each state y → X or x → Y, such as the expected number of Y/N questions it takes to identify the intervention on X at t given some y at t+1, vi) it isolates the information solely contained in the transition table (the actual causal structure), and vii) the metric is provably grounded in traditional notions of causal influence. Ultimately, the metric is using information theory to track the counterfactual dependence of Y on X. In traditional causal terminology this is putting a bit value on notions like how necessary and sufficient the state of X is for the state of Y. The theory generalizes these properties as determinism (the lack of noise) and degeneracy (amount of convergence) over the state transitions, and proves that EI actually decomposes into these properties. EI is low if states of X only weakly determine the states of Y, or if many states of X determine the same states of Y (as those states are unnecessary from the perspective of Y). It is maximal only if all the causal relationships in X → Y are composed of biconditional logical relationships (iff x then y). Another way to think about it is that EI captures how much difference each possible do(x) makes. In causal relationships where all states transition to the same state, no state makes a difference, so the EI is zero. If all interventions lead to completely random effects, the measure is also zero. The measure is maximal (equal to the logarithm of the number of states) if each intervention has a unique effect (i.e., interventions on X make the maximal difference to Y). How is the metric applied to systems? Consider, a simple switch/bulb system where the light doesn’t flicker. The relationship of which can be represented by the transition table: ![]() In this system the causal structure is totally deterministic (there’s no noise). It’s also non-degenerate (all states transition to unique states). So the switch being {up} is both sufficient and necessary for the bulb being {on}. Correspondingly, the EI is 1 bit. However, for the previous case where the light flickered, the EI would be lower, at 0.61 bits. Effective information even captures things traditional, non-information-theoretic measures of causation don’t capture. For instance, let’s say that we instead analyze a system of a light dial (with 256 states) and a light bulb with 256 states of luminance. Both the determinism (1) and degeneracy (0) are identical to the original binary switch/bulb system. But the causal structure overall contains a lot more information: each state leads to exactly one other state and there are hundreds of states. EI in this system is 8 bits, instead of 1 bit. One can use the metric to look at individual causal relationships, but generally in the theory it’s used to assess the causal structure of a system as a whole. Often, this is something like a Markov process or a set of logic gates, which can look like a set of states and their transitions (left) or mechanisms with connections (right). ![]() Instead of thinking about causal structure in terms of just X → Y, we can instead ask about the causal structure of the system as a whole, S → S. This is like thinking of the entire system as a channel that is transforming the past into the future. We can construct a transition probability matrix (TPM), which is a transition table over the entire set of system states, which reflects this. For example, for the above Markov process above, for the states {00, 01 10, 11} the TPM (which is the causal structure of the system of a whole) is on the right. Couldn’t you use some other measure or numerical value? The goal of EI is to capture how much information is associated with the causal structure of the system. But doing so doesn’t prove it’s the one and only true measure. There could be a family of similar metrics, although my guess is that most break down into the EI, or close variants. Regardless, given its relationship to the mutual information as well as important causal properties, this isn’t some arbitrary metric that's swappable with something entirely different. For instance, EI fits well with other fundamental information theory concepts, like the channel capacity. The channel capacity is the upper bound of information that can be reliably sent over a channel. In the context of the theory, the channel capacity is how much information someone can possibly send over the X → Y relationship by varying the state of X according to any intervention distribution. The channel capacity does end up having an important connection to causal structure. However, it’s not the same as a direct metric of X’s causal influence on Y. For instance, knowing the channel capacity doesn’t tell you about the determinism and degeneracy of the causal relationships of the states, nor does it tell you if interventions on X will produce reliable effects on Y, nor how sensitive the states of Y are to the states of X. With that said, one of the interesting conclusions of the research is that by looking at higher scales EI can approach or be equal to the channel capacity. What does any of this have to do with emergence? It’s about the emergence of higher-scale causal structure. To see if this is happening in a system, we do causal analysis across scales and measure the effective information at those different scales. What counts a macroscale? Broadly, any description of a system that's not the most detailed microscale. Leaving some states exogenous, coarse-grains (grouping states/elements), black boxes (having states/elements be exogenous when they are downstream of interventions), setting some initial state or boundary condition, all these are macroscales in the broad sense. Moving from the microscale to a macroscale might look something like this: Interestingly, macroscales can have higher EI than the microscale. Basically, in some systems, doing a full series of A/B tests at the macroscale gives you more information than doing a corresponding full series of A/B tests at the microscale. More generally, you can think about it as how informative a TPM of the system is, and how that TPM gets more informative at higher scales.
Wait. How is that even possible? There are multiple answers. In a general sense, causal structure is scale-variant. Microscale mechanisms (like NOR gates in a computer) can form a different macroscale mechanism (like a COPY gate). This is because the conditional probabilities of state-transitions change across scales. Consequently, the determinism can increase and the degeneracy can decrease at the higher scale (the causal relationships can be stronger). Another answer is from information theory. Higher-scale relationships can have more information because they are performing error-correction. As Shannon demonstrated, you can increase how much information is transmitted across a channel by changing the input distribution. The analogy is that intervening on a system at different scales is like trying different inputs into a channel. From the perspective of the microscale, some higher-scale distributions will transmit more information. This is because doing a series of A/B tests to capture the effects of the macroscale states doesn’t correspond to doing a series of A/B tests to capture the effects of microscale states. A randomized trial at the macroscale of medical treatments to see their effect on tumors won’t correspond to an underlying set of microscale randomized trials, because many different microstates make up the macrostates. Isn’t causal emergence just an issue of normalization, as Scott claimed in his blog post? Not at all. To assess causal emergence there must be a comparison between two cases. This is between the fully detailed and most fine-scaled description of a system (the territory) and some reduced description (a map). To clearly see where Scott is wrong, consider the simplest example of causal emergence: the information gained by leaving a single state out of the system. Here, one case is when the state is indeed included in the system (the territory). In the formal language of the theory, this means we intervene on that particular state and observe the results, along with all the other states. This is done to measure the information produced by a distribution of interventions. Let's put aside why we want that information for now and just focus on the distribution, which might look something like [p(do(s1))=1/3, p(do(s2))=1/3, p(do(s3))=1/3]. In the second case we leave s3 out of the system, which definitionally means not intervening on or observing it, so p(do(s3))=0 instead of 1/3. Causal emergence would be when the information generated by the intervention distribution increases in this second case where p(do(s3))=0. Scott Aaronson claimed that because the intervention distribution changes between the two cases, causal emergence is some kind of "normalization trick." According to Scott, to make the comparison "fairly" in this simple example would require dropping s3 from both the map and the territory, that is, that p(do(s3)) must be 0 in both cases, even in the case where s3 is included in the description. Note that Scott has no mathematical reason for mandating this, and the only math he did in his post was pointing out that if you use the same intervention distributions in both cases, you get the equivalent bit value. This is a mere tautology, as you are just doing the same thing in both cases. Since the whole point of the research was that we wanted to measure the information change between maps and territories, maintaining that intervention distributions must be the same for both flat-out misses the point of the research and at the same time manages to lead to nonsensical comparisons. This holds true for all manner of macroscales, is obvious in example systems, including the one Scott used in his post, and doesn’t having anything to do with using the maximum entropy distribution to measure effective information. So even before any mathematics comes into play, the criticism is merely a failure to understand the ways in which the intervention distribution can and should change depending on how the system is modeled or described. For instance, two variables grouped together into a macro variable by an experimenter will not be intervened on in the same way as before, leaving out a state implies not intervening/observing that state, etc. Scott never responded to this most basic of points. He also appeared not to understand the central thrust of the paper: that this change in the intervention distribution is analogous to how changes in an information channel's input distribution can lead to more information being transmitted over that channel. It would be an equivalent mistake to call Shannon's description of how to increase information over a channel "an issue of normalization" merely because the input distribution into a channel changes depending on the encoding of some signal or choices the user makes. Why does causal emergence matter? The theory does imply that universal reductionism is false when it comes to thinking about causation, and that sometimes higher scales really do have more causal influence (and associated information) than whatever underlies them. This is common sense in our day-to-day lives, but in the intellectual world it’s very controversial. More importantly, the theory provides a toolkit for judging cases of emergence or reduction with regards to causation. It also provides some insight about the structure of science itself, and why it’s hierarchical (biology above chemistry, chemistry above physics). One reason the theory provides is that scientists naturally gravitate to where the information about causal structure is greatest, which is where they are rewarded in terms of information for their experiments the most, and this won't always be the ultimate microscale. There are also specific applications of the theory, some of which are already underway. These are things like figuring out what scale of a biological or nervous system is most informative to do experiments on, or the causal influence of one part of a system over another part, or whether macrostates or microstates matter more to the behavior of a system. As time goes on, I’m sure I’ll criticize various ideas on this blog. But I'll make sure that I always keep an open mind when I do and never rush to judgment.
30 Comments
julia buntaine
6/8/2017 06:42:32 am
great post!
Reply
George Musser
6/8/2017 10:32:37 am
Alas, I took away then opposite lesson: do rush, because people pay a lot more attention to that than to prudence. Your ideas have gotten a lot of attention through Scott's rush, and as frustrating as some of responses have been, that's ultimately good for your ideas.
Reply
Will
6/8/2017 12:07:59 pm
Several things said in the blog post seem to me to be wrong, even for an informal introduction. Especially because you are quite critical of Scott in this post, I thought it was worth pointing out. I hope you don't find this rude.
Reply
Thanks so much for your comment Will. I appreciate your input. Scott and I have had some great interactions and I certainly don't mean to sound overly-critical (I actually edited to reflect this just in case). I just think Scott's criticism misses that we are trying to assess causation.
Reply
Will
6/8/2017 05:30:36 pm
The edited version seems much better.
Erik Hoel
6/9/2017 06:22:38 am
To your comments: perhaps check out how the EI reflects the state-transition profiles in terms of the probability distribution of future/possible states in Hoel et al. (2013). The error-correction comes from the fact that if you’re trying to “send” a single state that is part of a family of multiple-realizable states, and if due to noise another state is received, then it’s as if that error is being corrected via the redundancy of multiple-realizability. And keep in mind that the EI will be the same as the channel capacity in many cases. In the papers I propose that the channel capacity is like the fixed causal structure and assessing the effects of individual states at different scales (in such a way that concords with identifying their sufficiency, necessity, etc, all calculated using their state-transition profiles) is like using different inputs. I think your hypothesis as to why scientists work at higher scales is the standard reasonable one, and, as I keep saying, it's not incompatible with what’s being suggested in the theory. I just don’t think it’s the end of the story. Thanks for stopping by - Erik
Larry Wasserman
6/8/2017 04:36:03 pm
Erik:
Reply
Erik Hoel
6/8/2017 05:10:35 pm
Hey Larry - I appreciate you taking the time to read through. I wouldn't claim that all experiments must, no matter what, must be over a uniform distribution. You can do statistics over any distribution. I'm pointing out that the reason behind EI having the distribution it does (the maximum entropy) is so that it captures the information solely in the conditional probabilities by screening off the frequency of interventions. So it looks to capture causal structure via enforcing an independent noise source, which isn't the same as applying an arbitrary distribution.
Reply
6/9/2017 01:51:00 pm
I was the one who prodded Scott into commenting on your paper.
Reply
Erik Hoel
6/9/2017 05:13:45 pm
So you're the one who started this James! haha
Reply
Randall
6/9/2017 02:05:20 pm
Erik, Looking forward to seeing more of your work in the future. Rather than taking a simple example of the lightbulb, I think it would be more fruitful to take a cellular automaton that is turing-complete. Being turing complete it could do some very interesting computation. It seems inevitable (to me!) that making sense of the results would require a macroscale view.
Reply
Erik Hoel
6/9/2017 05:23:05 pm
Thanks Randall, I appreciate that. The idea of demonstrating causal emergence with cellular automata (instead of Markov process or networks of logic gates like shown in the papers) is definitely a good suggestion. You are right that it would provide a way to directly investigate its relationship to things like Turing completeness. Macroscales definitely perform different computations than microscales (this is pretty clear in the papers) but we haven't looked into the relationship to Turing completeness.
Reply
Larry
6/10/2017 07:58:22 am
Is Wolchovere's headline
Reply
Erik Hoel
6/10/2017 09:11:49 am
Hey Larry - thanks for your thoughts. To your first question: the article also covered IIT, so it’s hard to precisely separate all the claims and I certainly wouldn't want to start judging Natalie's work (she knows more about science communication than I do). More to the point, as I have written articles for sites before, a shocking thing is that often a headline writer (not the writer themselves) chooses the headline, based on SEO. But I wouldn’t say Natalie misrepresented the theory; rather, she focused on the strongest claims of the theory.
Reply
Natesh Ganesh
6/11/2017 08:57:40 pm
Hi Erik,
Reply
Erik Hoel
6/12/2017 07:53:41 am
Thanks for reading Natesh. The example is of a causal relationship (which can be represented as a TPM) but you're right that the example system isn't a completely described Markov process in the sense you're talking about, because the rows and columns are over different states. One could make the same TPMs with simple Markov processes of two states, A & B, in which case for each entry [A->A, A->B; B->A, B->B] the (i,j) is the probability of that state-transition. These are the types of processes used in the papers, and we want to know how much information is gained by intervening upon them. What EI then captures is how different the full set of (state-transition distribution|state) are by comparing them all. This is also how much information some set of randomized interventions produce, because it is only a high bit value when all the states have different and high-probability effects, meaning that for each intervention, such as do(A) and do(B), you get an informative effect.
Reply
Natesh Ganesh
6/13/2017 09:46:13 am
Thanks, that cleared up a few things. So are all transition matrices in your papers structured the same way? That is, it is not the traditional Markov PTM but with rows and columns relating to different systems? I ask since I have seen your coarse graining examples and wonder when you go from a 4x4 to a 2x2 matrix, in those cases, you are dealing with the traditional Markov PTM.
Erik Hoel
6/13/2017 12:31:41 pm
Yup, the ones in the paper are traditional PTMs/TPMs with rows and columns (i,j) that dictate the probability of a state transition within a particular system, not two different systems.
Natesh Ganesh
6/19/2017 11:43:19 am
Sorry for the gap in response, but paper deadlines keep one busy.
Erik Hoel
6/20/2017 10:00:06 am
Hey Natesh - I've been adding some more to the primer, specifically concerning information theory for those interested. It should now gives you a clearer idea of what's going on (the papers also address these questions).
P. Applebee
6/18/2017 12:40:27 pm
Thanks for writing this, Erik. I think some of the issues stem from what appear to be "mixed missaging" from you and people that have written about you. For example, everything in your first two sections here sounds totally reasonable and very interesting, yet on your Twitter page there is a tweet admonishing the idea that "my atoms made me do it". Yet here on your blog, it seems that you actually *aren't* making that argument: you're saying something much weaker, just that talking about atoms isn't the right level of description to explain "why you did something". In fact, wouldn't you admit that if physical determinism is true, your behavior *is* entirely determined by the behavior of your atoms?
Reply
Erik Hoel
6/18/2017 02:41:35 pm
Thanks for reading. Anything written for a popular audience about any subject won't be complete. In general, the theory is sketched pretty clearly in the papers, the primer, and essays. Given the theory, I think there's a good chance that higher scales really do cause behavior, in which case, yes, it would be incorrect to clam that "my atoms made me do it." To your question; causal emergence can occur in deterministic systems (and many people say we're not living in a deterministic universe anyways).
Reply
6/21/2017 04:45:45 am
I have not read all your work, nor all the comments, but "causal emergence" caught my attention and seems to me right on target, not least because it overlaps with (my modest understanding of) the "process philosophy" of Alfred North Whitehead (and, tangentially, a core notion in Buddhism). Which is to say: "causal emergence" plays a vital role in "creative evolution" as presented in Whitehead's (scientifically informed) metaphysics (and various Buddhist sutras). I'm not the person to elaborate on these two suggestions. I simply present them to you as provocative edifications -- potentially useful to you in your expanding research and interpretation, which might then also include Henri Bergson.
Reply
Erik Hoel
6/21/2017 01:47:44 pm
Thanks Stefan. I'll confess I'm not familiar with process philosophy, beyond being aware of its existence and associating it with Whitehead. I haven't read any Bergson either, although I know of his infamous debate with Einstein. Both are interesting suggestions. One thing is that I am trying to avoid too much metaphysics - people get up in arms about that sort of things very easily (and I've noticed it's more when other people describe my work than when they read my work). I think causation will be like information: something that started out seeming very philosophical in its questions and debates but then eventually got (mostly) skimmed away from philosophy into science and math. That's kind of what I'm hoping to contribute to here, but with causation and issues of reduction/emergence.
Reply
T. Anton
8/7/2017 04:03:54 am
Nice work, however, there are two avenues of attack that immediately occur to me:
Reply
Erik Hoel
8/7/2017 09:06:20 am
Hey T. Anton - thanks! Great questions.
Reply
This is a copy (because I am lazy :)of the comment I just left on Scott's blog entry.
Reply
3/5/2019 12:32:41 pm
I realize that this is an old post, but I'm curious about one thing:
Reply
Leave a Reply. |