ERIK HOEL
  • Home
  • Writing
  • Science
  • About
  • Appearances
  • Home
  • Writing
  • Science
  • About
  • Appearances
Search

craniotomies

A primer on causal emergence

6/8/2017

30 Comments

 
Welcome.
 
My first post is inspired by the physicist and blogger Scott Aaronson, who recently blogged his criticisms about a theory I’ve been working on, called causal emergence. To see the simple nature of his error, skip down to Isn’t causal emergence just an issue of normalization?, although this does assume you are familiar with some of the theory's terminology. Since Scott's criticisms reflected a lack of understanding of the theory, it prompted me to do this generalized explainer. Please note this explainer is purposefully designed to not be technical, formalized, or comprehensive. Its goal is to give interested parties a conceptual grasp on the theory, using relatively basic notions of causation and information.
 
What’s causal emergence?
It’s when the higher scale of a system has more information associated with its causal structure than the underlying lower scale. Causal structure just refers to a set of causal relationships between some variables, such as states or mechanisms. Measuring causal emergence is like you're looking at the causal structure of a system with a camera (the theory) and as you focus the camera (look at different scales) the causal structure snaps into focus. Notably, it doesn’t have to be “in focus" at the lowest possible scale, the microscale. Why is this? In something approaching plain English: macrostates can be strongly coupled even while their underlying microstates are only weakly coupled. The goal of the theory is to search across scales until the scale at which variables (like elements or states) are most strongly causally coupled pops out.

​Isn’t this against science, reductionism, or physicalism?
Nope. The theory adheres to something called supervenience. That’s a technical term that means: if everything is fixed about the lower scale of a system, everything about the higher scale must follow. So there’s nothing spooky, supernatural, or anti-physicalist about the results. Rather, the theory provides a toolkit to identify appropriate instances of causal emergence or reduction depending on the properties of the system under consideration. It just means that, when thinking about causation, reductionism isn’t always best. The higher scales the theory considers are things like coarse-grains (groupings of states or mechanisms) or leaving states or elements out of the system, among others. These aren’t supernatural, just different levels of description, some of which capture the real causal structure (the coupling between variables) better. In this sense, the theory says that causal interpretations are not relative or arbitrary but instead are constrained by constraint.
 
How do you analyze causal structure?
Causation has long been considered a philosophical subject, even though it’s at the heart of science in the form of experiments and separating correlation from causation. Causation, much like information, can actually be formalized abstractly and mathematically. For instance, in the 90s and 2000s, a researcher named Judea Pearl introduced something called the do(x) operator. The idea is to formalize causation by modeling the interventions an experimenter makes on a system. Let’s say I want to check if there is a causal relationship between a light switch and a light bulb in a room. Formally, I would do(light switch = up) at some time t, and observe the effects on the bulb at some time t+1.
 
One of the fundamental ways of analyzing causation is what’s called an A/B test, or a randomized trial. For two variables A and B, you randomize those variables and observe the outcome of your experiment. Think of it like injecting noise into the experiment, which then tells you which of those two variables is more effective at producing the outcome. For example, let’s say the light bulb flickers into the {off} state while the light switch is in the {up} state 20% of the time. If you do(light switch = up) and then do(light switch = down) at in a 50/50 manner, it reveals the effects of the states. From this, you can construct something (using Bayes’ theorem) called a transition table:
Picture
Note that this reflects the actual causal structure. Flipping the switch {up} really does cause the bulb to turn {on} 80% of the time. Doing the A/B test appropriately screened out everything but the conditional probabilities between the states, such as how often you flipped the switch. And, ultimately, causal relationships are conditional. They aren’t about the probabilities of the states themselves, but about “if x then y” classes of statements.
 
Of course, we can also do A/B/C tests, and so on. What matters is randomizing over everything (creating an independent noise source) so that the result exposes the conditional probabilities between the states. The theory of causal emergence formalizes this as applying an intervention distribution: a probability distribution of do(x) operators. The intervention distribution that corresponds to an A/B test would be [1/2 1/2], and if A/B/C, then [1/3 1/3 1/3]. This is called a maximum entropy, or uniform, distribution.
​
How does information theory relate to causal structure?
Consider two variables, X and Y. We want to assess the causal influence X has over Y, X → Y. Assume, for now, there are no other effects on Y. If every change in the state of X is followed by a change in the state of Y, then the state of X contains a lot of causal information about Y. So if Y is very sensitive to the state of X, a metric of causal influence should be high. Note this is different than predictive information. You might be able to predict Y given X even if changes in X don’t lead to changes in Y (like how sales of swim wear in June could predict sales of air conditioners in July).

To be more formal about assessing X → Y, we inject noise into X and observe the effects on Y. Effective information, or EI, is the mutual information I(X;Y) between X and Y while intervening to set X to maximum entropy (inject noise). Note that this is the same as applying a uniform intervention distribution over the states of X.
 
This is a measure of what can be called influence, or coupling, or constraint. Beyond capturing in an intuitive way how Y is causally coupled to the state of X, here are a few additional reasons that the metric is appropriate: i) setting X to maximum entropy screens off everything but the conditional probabilities, ii) it’s like doing an experimental randomized trial, or A/B test, without prior knowledge of the effects on Y of X’s states, iii) it doesn’t leave anything out, so if a great many states of X don’t impact Y, this will be reflected in the causal influence of X on Y, iv) it’s like injecting the maximum amount of experimental information into X, Hmax(X), in order to see how much of that information is reflected in Y, v) the metric can be derived from the cause/effect information of each state y → X  or x → Y, such as the expected number of Y/N questions it takes to identify the intervention on X at t given some y at t+1, vi) it isolates the information solely contained in the transition table (the actual causal structure), and vii) the metric is provably grounded in traditional notions of causal influence. 

Ultimately, the metric is using information theory to track the counterfactual dependence of Y on X. In traditional causal terminology this is putting a bit value on notions like how necessary and sufficient the state of X is for the state of Y. The theory generalizes these properties as determinism (the lack of noise) and degeneracy (amount of convergence) over the state transitions, and proves that EI actually decomposes into these properties. EI is low if states of X only weakly determine the states of Y, or if many states of X determine the same states of Y (as those states are unnecessary from the perspective of Y). It is maximal only if all the causal relationships in X → Y are composed of biconditional logical relationships (iff x then y).
 
Another way to think about it is that EI captures how much difference each possible do(x) makes. In causal relationships where all states transition to the same state, no state makes a difference, so the EI is zero. If all interventions lead to completely random effects, the measure is also zero. The measure is maximal (equal to the logarithm of the number of states) if each intervention has a unique effect (i.e., interventions on X make the maximal difference to Y).
 
How is the metric applied to systems?
Consider, a simple switch/bulb system where the light doesn’t flicker. The relationship of which can be represented by the transition table:
Picture
Picture
In this system the causal structure is totally deterministic (there’s no noise). It’s also non-degenerate (all states transition to unique states). So the switch being {up} is both sufficient and necessary for the bulb being {on}. Correspondingly, the EI is 1 bit. However, for the previous case where the light flickered, the EI would be lower, at 0.61 bits.
 
Effective information even captures things traditional, non-information-theoretic measures of causation don’t capture. For instance, let’s say that we instead analyze a system of a light dial (with 256 states) and a light bulb with 256 states of luminance. Both the determinism (1) and degeneracy (0) are identical to the original binary switch/bulb system. But the causal structure overall contains a lot more information: each state leads to exactly one other state and there are hundreds of states. EI in this system is 8 bits, instead of 1 bit. 
 
One can use the metric to look at individual causal relationships, but generally in the theory it’s used to assess the causal structure of a system as a whole. Often, this is something like a Markov process or a set of logic gates, which can look like a set of states and their transitions (left) or mechanisms with connections (right).

Picture
Instead of thinking about causal structure in terms of just X → Y, we can instead ask about the causal structure of the system as a whole, S → S. This is like thinking of the entire system as a channel that is transforming the past into the future. We can construct a transition probability matrix (TPM), which is a transition table over the entire set of system states, which reflects this. For example, for the above Markov process above, for the states {00, 01 10, 11} the TPM (which is the causal structure of the system of a whole) is on the right.

Couldn’t you use some other measure or numerical value?
The goal of EI is to capture how much information is associated with the causal structure of the system. But doing so doesn’t prove it’s the one and only true measure. There could be a family of similar metrics, although my guess is that most break down into the EI, or close variants. Regardless, given its relationship to the mutual information as well as important causal properties, this isn’t some arbitrary metric that's swappable with something entirely different.

For instance, EI fits well with other fundamental information theory concepts, like the channel capacity. The channel capacity is the upper bound of information that can be reliably sent over a channel. In the context of the theory, the channel capacity is how much information someone can possibly send over the X → Y relationship by varying the state of X according to any intervention distribution. The channel capacity does end up having an important connection to causal structure. However, it’s not the same as a direct metric of X’s causal influence on Y. For instance, knowing the channel capacity doesn’t tell you about the determinism and degeneracy of the causal relationships of the states, nor does it tell you if interventions on X will produce reliable effects on Y, nor how sensitive the states of Y are to the states of X. With that said, one of the interesting conclusions of the research is that by looking at higher scales EI can approach or be equal to the channel capacity.
 
What does any of this have to do with emergence?
It’s about the emergence of higher-scale causal structure. To see if this is happening in a system, we do causal analysis across scales and measure the effective information at those different scales. What counts a macroscale? Broadly, any description of a system that's not the most detailed microscale. Leaving some states exogenous, coarse-grains (grouping states/elements), black boxes (having states/elements be exogenous when they are downstream of interventions), setting some initial state or boundary condition, all these are macroscales in the broad sense.  Moving from the microscale to a macroscale might look something like this:

Picture
Interestingly, macroscales can have higher EI than the microscale. Basically, in some systems, doing a full series of A/B tests at the macroscale gives you more information than doing a corresponding full series of A/B tests at the microscale. More generally, you can think about it as how informative a TPM of the system is, and how that TPM gets more informative at higher scales.
 
Wait. How is that even possible?
There are multiple answers. In a general sense, causal structure is scale-variant. Microscale mechanisms (like NOR gates in a computer) can form a different macroscale mechanism (like a COPY gate). This is because the conditional probabilities of state-transitions change across scales. Consequently, the determinism can increase and the degeneracy can decrease at the higher scale (the causal relationships can be stronger).
 
Another answer is from information theory. Higher-scale relationships can have more information because they are performing error-correction. As Shannon demonstrated, you can increase how much information is transmitted across a channel by changing the input distribution. The analogy is that intervening on a system at different scales is like trying different inputs into a channel. From the perspective of the microscale, some higher-scale distributions will transmit more information. This is because doing a series of A/B tests to capture the effects of the macroscale states doesn’t correspond to doing a series of A/B tests to capture the effects of microscale states. A randomized trial at the macroscale of medical treatments to see their effect on tumors won’t correspond to an underlying set of microscale randomized trials, because many different microstates make up the macrostates.
 
Isn’t causal emergence just an issue of normalization, as Scott claimed in his blog post?
Not at all. To assess causal emergence there must be a comparison between two cases. This is between the fully detailed and most fine-scaled description of a system (the territory) and some reduced description (a map). To clearly see where Scott is wrong, consider the simplest example of causal emergence: the information gained by leaving a single state out of the system. Here, one case is when the state is indeed included in the system (the territory). In the formal language of the theory, this means we intervene on that particular state and observe the results, along with all the other states. This is done to measure the information produced by a distribution of interventions. Let's put aside why we want that information for now and just focus on the distribution, which might look something like [p(do(s1))=1/3, p(do(s2))=1/3, p(do(s3))=1/3]. In the second case we leave s3 out of the system, which definitionally means not intervening on or observing it, so p(do(s3))=0 instead of 1/3. Causal emergence would be when the information generated by the intervention distribution increases in this second case where p(do(s3))=0. Scott Aaronson claimed that because the intervention distribution changes between the two cases, causal emergence is some kind of "normalization trick." According to Scott, to make the comparison "fairly" in this simple example would require dropping s3 from both the map and the territory, that is, that p(do(s3)) must be 0 in both cases, even in the case where s3 is included in the description. Note that Scott has no mathematical reason for mandating this, and the only math he did in his post was pointing out that if you use the same intervention distributions in both cases, you get the equivalent bit value. This is a mere tautology, as you are just doing the same thing in both cases. Since the whole point of the research was that we wanted to measure the information change between maps and territories, maintaining that intervention distributions must be the same for both flat-out misses the point of the research and at the same time manages to lead to nonsensical comparisons. This holds true for all manner of macroscales, is obvious in example systems, including the one Scott used in his post, and doesn’t having anything to do with using the maximum entropy distribution to measure effective information. So even before any mathematics comes into play, the criticism is merely a failure to understand the ways in which the intervention distribution can and should change depending on how the system is modeled or described. For instance, two variables grouped together into a macro variable by an experimenter will not be intervened on in the same way as before, leaving out a state implies not intervening/observing that state, etc. Scott never responded to this most basic of points. He also appeared not to understand the central thrust of the paper: that this change in the intervention distribution is analogous to how changes in an information channel's input distribution can lead to more information being transmitted over that channel. It would be an equivalent mistake to call Shannon's description of how to increase information over a channel "an issue of normalization" merely because the input distribution into a channel changes depending on the encoding of some signal or choices the user makes.

Why does causal emergence matter?
The theory does imply that universal reductionism is false when it comes to thinking about causation, and that sometimes higher scales really do have more causal influence (and associated information) than whatever underlies them. This is common sense in our day-to-day lives, but in the intellectual world it’s very controversial. More importantly, the theory provides a toolkit for judging cases of emergence or reduction with regards to causation. It also provides some insight about the structure of science itself, and why it’s hierarchical (biology above chemistry, chemistry above physics). One reason the theory provides is that scientists naturally gravitate to where the information about causal structure is greatest, which is where they are rewarded in terms of information for their experiments the most, and this won't always be the ultimate microscale. There are also specific applications of the theory, some of which are already underway. These are things like figuring out what scale of a biological or nervous system is most informative to do experiments on, or the causal influence of one part of a system over another part, or whether macrostates or microstates matter more to the behavior of a system.
 
As time goes on, I’m sure I’ll criticize various ideas on this blog. But I'll make sure that I always keep an open mind when I do and never rush to judgment.
30 Comments
julia buntaine
6/8/2017 06:42:32 am

great post!

Reply
George Musser
6/8/2017 10:32:37 am

Alas, I took away then opposite lesson: do rush, because people pay a lot more attention to that than to prudence. Your ideas have gotten a lot of attention​ through Scott's rush, and as frustrating as some of responses have been, that's ultimately good for your ideas.

Reply
Erik Hoel link
6/8/2017 11:38:04 am

Thanks for the encouraging note George. Hopefully it brings new interest to the field along with new ideas. There's a bright future for research into causation and information, as long as it can survive the initial conservative backlash.

Reply
Will
6/8/2017 12:07:59 pm

Several things said in the blog post seem to me to be wrong, even for an informal introduction. Especially because you are quite critical of Scott in this post, I thought it was worth pointing out. I hope you don't find this rude.

> If you did an inappropriate A/B test that was 90% A and 10% B, that wouldn’t be a randomized trial, and it wouldn’t capture the causal relationships.

Unless the measure is 100% A, 0% B then this is a randomized trial. Both A/B tests in business and randomized trials in science often have reason to work with uneven distributions in practice.

> Higher-scale relationships can have more information because they are performing error-correction. As Shannon demonstrated, you can increase how much information is transmitted across a channel by changing the input distribution.

This is not a correct explanation of how error-correcting codes work. Using a uniform distribution over an error-correcting code, instead of a uniform distribution over all bit strings, does not increase the mutual information, which is a measure of the expected information received. Instead, it ensures with high probability that a smaller amount of information is received exactly.

A better analogy for what the higher scale is doing is changing the input probability distribution to put a higher weight on signals where errors are less likely.

> A randomized trial at the macroscale of medical treatments to see their effect on tumors won’t correspond to an underlying set of microscale randomized trials, because many different microstates make up the macro states.

This fails to mention that the only way this correspondence can fail is when different numbers of micro states make up each macrostate.

> Only the maximum entropy distribution screens off everything but the conditional probability transitions between the states, i.e., the things that contain the information about causal relationships.

This is not true. The distribution that maximizes the mutual information also screens off everything but the conditional probability transitions, because the distribution is determined entirely by the conditional probability transitions.

> This is common sense in our day-to-day lives, but in the intellectual world it’s very controversial. More importantly, the theory provides a toolkit for judging cases of emergence or reduction. It also provides an explanation about the structure of science itself, and why it’s hierarchical (biology above chemistry, chemistry above physics). This might be because scientists naturally gravitate to where the information about causal structure is greatest, which is where they are rewarded in terms of information for their experiments the most, and this won't always be the ultimate microscale.

The theory of causal emergence does not fit our common-sense intuition about the appropriate scale to work on. I showed this with some examples at Scott Aronson’s blog.

In particular, it does not explain why scientists sometimes work on a higher-level scale. It suggests that scientists should never work on a scale higher than fundamental physics if the different macro states of that scale have the same entropy. I have never seen chemists, upon noticing that the states of a chemical system that they are studying have the same entropy, give up on the problem and pass it on to physicists.


Reply
Erik Hoel link
6/8/2017 12:55:12 pm

Thanks so much for your comment Will. I appreciate your input. Scott and I have had some great interactions and I certainly don't mean to sound overly-critical (I actually edited to reflect this just in case). I just think Scott's criticism misses that we are trying to assess causation.

Some of your points are IMO issues of using non-technical language (like the A/B test thing - the point is what distribution reveals the transitions, not what people can in general due using all of statistics) or the error-correction (in the sentence you're referring to I'm not actually directly describing error-correcting codes but leading into the next point). In your point about the mutual information, I would rephrase it as the channel capacity being reflective only of the conditional probabilities. Definitely true, and that's associated with a particular scale when the EI of that scale approaches it.

To your point about scientists - the theory certainly doesn't suggest that scientists should not work on scales that aren't causally emergence. It just suggests a general hypothesis that scientists seek out scales that give them more information about causation, which are often causally emergent.

Thanks for your time - Erik

Reply
Will
6/8/2017 05:30:36 pm

The edited version seems much better.

I don't think Scott misses that you are trying to address causation. The issue is a scientific disagreement into how causation is addressed in the model.

With regards to the A/B testing, I still think that's not a very fair summary but the main difficulty is I don't find your argument for the use of the uniform distribution, in its long form or in these shorter summaries, very convincing.

With regards to the error correction, the issue is that there is no analogy between the way error-correcting codes work and the way causal emergence works.

With regards to the channel capacity, the issue is that you can calculate the channel capacity of any scale and it's much more natural from a lot of perspectives (e.g. many concrete examples) to study scales where the channel capacity is large than to study scales where the EI is large.

But I guess the primary issue has to do with scientists. To me there is no big mystery about why scientists work on different levels of organization. The primary issue is that, while we agree that "if everything is fixed about the lower scale of a system, everything about the higher scale must follow", we don't know how or why it follows - e.g. we cannot prove that water has a particular boiling point using a purely physics-based calculation, so the only way to find the boiling point is to perform an experiment at the macroscale. It is not too hard to come up with a few variants of this problem, and toy mathematical models of them, but causal emergence is clearly a model of a totally different phenomenon.

Erik Hoel
6/9/2017 06:22:38 am

To your comments: perhaps check out how the EI reflects the state-transition profiles in terms of the probability distribution of future/possible states in Hoel et al. (2013). The error-correction comes from the fact that if you’re trying to “send” a single state that is part of a family of multiple-realizable states, and if due to noise another state is received, then it’s as if that error is being corrected via the redundancy of multiple-realizability. And keep in mind that the EI will be the same as the channel capacity in many cases. In the papers I propose that the channel capacity is like the fixed causal structure and assessing the effects of individual states at different scales (in such a way that concords with identifying their sufficiency, necessity, etc, all calculated using their state-transition profiles) is like using different inputs. I think your hypothesis as to why scientists work at higher scales is the standard reasonable one, and, as I keep saying, it's not incompatible with what’s being suggested in the theory. I just don’t think it’s the end of the story. Thanks for stopping by - Erik

Larry Wasserman
6/8/2017 04:36:03 pm

Erik:
Why do you say"
"Only the maximum entropy distribution screens off everything but the conditional probability transitions between the states"
`Screening off' does indeed result from randomization but
that does not require a uniform distribution.
We use non-uniform distributions in randomized trials all the time.
--Larry

Reply
Erik Hoel
6/8/2017 05:10:35 pm

Hey Larry - I appreciate you taking the time to read through. I wouldn't claim that all experiments must, no matter what, must be over a uniform distribution. You can do statistics over any distribution. I'm pointing out that the reason behind EI having the distribution it does (the maximum entropy) is so that it captures the information solely in the conditional probabilities by screening off the frequency of interventions. So it looks to capture causal structure via enforcing an independent noise source, which isn't the same as applying an arbitrary distribution.
Thanks again - Erik

Reply
James Cross link
6/9/2017 01:51:00 pm

I was the one who prodded Scott into commenting on your paper.

Is there a way of following your blog? A way to get an email if you do new posts?

Reply
Erik Hoel
6/9/2017 05:13:45 pm

So you're the one who started this James! haha
I just set up a tinyletter (at the top) that I'll update semi-regularly either with posts here, links to essays I publish on other sites, or when I publish a paper I think people might like.

Reply
Randall
6/9/2017 02:05:20 pm

Erik, Looking forward to seeing more of your work in the future. Rather than taking a simple example of the lightbulb, I think it would be more fruitful to take a cellular automaton that is turing-complete. Being turing complete it could do some very interesting computation. It seems inevitable (to me!) that making sense of the results would require a macroscale view.

Reply
Erik Hoel
6/9/2017 05:23:05 pm

Thanks Randall, I appreciate that. The idea of demonstrating causal emergence with cellular automata (instead of Markov process or networks of logic gates like shown in the papers) is definitely a good suggestion. You are right that it would provide a way to directly investigate its relationship to things like Turing completeness. Macroscales definitely perform different computations than microscales (this is pretty clear in the papers) but we haven't looked into the relationship to Turing completeness.

Reply
Larry
6/10/2017 07:58:22 am

Is Wolchovere's headline
"A Theory of Reality as More Than the Sum of Its Parts"
accurate?

To open with a very toy example: a dot can never be a triangle. When we have three dots, a triangle emerges. A triangle is more than a point,more than the concept of a point, BUT (and this is where I disagree with the insnuation in Quanta article) a triangle is not more than than "a set of three of more points" -- a set of 4 points is a strict superset (information-wise) that a single triangle

My lay understanding is that your theory is about (1) a mathematical model for optimizing the scale/resolution at which most of the remaining details of a model matter and the noise has been smoothed out, and (2) this can help us say which resolution is a practically tractable level of detail to anal a specific scientific problem, and(3)a way to say if a system is large/complex enough to be able to contain enough information to engage in some behavior (a sandwich cannot love Romeo, Julie can).

This seems a valuable way of mathematizing the intuition behind questions like "when is a question matter of physics vs chemistry vs psychology vs chess vs football".

In your camera analogy, it's more like "JPEG compression" than "focus" -- finding which part of the data matters. Or it's "focus" in the sense that when I have a million rsys of light flying at my sense sensor, I get more information about objects 100m away when I use a 200mm lense instead of 50mm lens -- if I could collect all the light ray's separately, I could see at all distances (like the muktifocus camera someone invented),but in PRACTICE it's impossible to process all the data without sampling somehow. So your theory provides guidance on how to choose a sampling algorithm.

it's like you've designed a general compression/classification algorithm strategy for science, identifying the essence of what makes something a distinct field of study from another.

But it doesn't seem to prove that it's *impossible in principle, with unlimited resources* to physically model Romeo's love for Juliet.
Your theory can tell us the best way to quantify "what do we really mean by 'Romeo', 'Juliet', and 'love'" -- it's a theory about how to label and understand reality, how to *interpret* the Universe, not a theory about where a certain part of reality is or isn't.


And it doesn't prove things like Penrosian "it's impossible for a computer to simulate a human because humans have something that cannot be created from fundamental parts. "

So,the Quanta article oversells the unsupported metaphysics, (annoyiing people like Scott (and me)) in order to "sex up" an otherwise hard-to-grasp sophisticated mathematical contribution you are making.

Reply
Erik Hoel
6/10/2017 09:11:49 am

Hey Larry - thanks for your thoughts. To your first question: the article also covered IIT, so it’s hard to precisely separate all the claims and I certainly wouldn't want to start judging Natalie's work (she knows more about science communication than I do). More to the point, as I have written articles for sites before, a shocking thing is that often a headline writer (not the writer themselves) chooses the headline, based on SEO. But I wouldn’t say Natalie misrepresented the theory; rather, she focused on the strongest claims of the theory.

Obviously, I would argue that the theory should be taken as a whole. But it’s also not completely unreasonable to consider it in terms of weak claims and strong claims. In weak claims it’s broadly what you're saying. In strong claims it’s also about causal influence. The distinction between an interpretation of the universe and what the universe is actually doing hangs, in my opinion, on this claim. I think there’s good reasons to believe the stronger claim about causal influence, for the reasons I say above and in the papers. However, one could disagree with that and still get something pretty interesting out of the theory.

Maybe the biggest thing I’ve struggled to get across is that I’m not talking about the general notion of information. Rather, that just by changing what you consider a state in a system (such as by coarse-graining) you can sometimes increase the information value that’s associated with the conditional probabilities of state-transitions. So it may very well be possible (in principle, not in practice) to model Romeo’s love for Juliet at the microscale. But this will contain less information about causal structure than if someone chose a higher scale. That is, there's no conflict between higher scales being derivable (don’t contain any more information in the most general sense of the term) and them also having more information about causation. Why? Because the total volume of the subset of information that is about the causal relationships can outstrip the lower scale, even if the total amount of all information is decreasing.

Reply
Natesh Ganesh
6/11/2017 08:57:40 pm

Hi Erik,
A good introduction to the idea of causal emergence. It cleared up a few things for me, but left me pondering about a few other technical details. I guess the following is my major question-
Let's take the light switch/bulb example. The TPM in the post, is that transition matrix for the light bulb alone or light bulb/light switch joint system? I am wondering what does the (i,j)-th element of your matrices in general and this example matrix mean? Is it the case that we have joint [light switch, bulb] going from a [off, off] stage to an [on,on] stage and vice versa. And is it like a traditional Markov transition matrix where we can go next state probabilities=current state*transition matrix.
I can see how EI captures structural properties of a transition matrix, but without being exactly sure what the elements in the matrix actually mean, I am not able to make the causality connection completely yet. I am looking forward to your answers. Thanks.
Natesh

Reply
Erik Hoel
6/12/2017 07:53:41 am

Thanks for reading Natesh. The example is of a causal relationship (which can be represented as a TPM) but you're right that the example system isn't a completely described Markov process in the sense you're talking about, because the rows and columns are over different states. One could make the same TPMs with simple Markov processes of two states, A & B, in which case for each entry [A->A, A->B; B->A, B->B] the (i,j) is the probability of that state-transition. These are the types of processes used in the papers, and we want to know how much information is gained by intervening upon them. What EI then captures is how different the full set of (state-transition distribution|state) are by comparing them all. This is also how much information some set of randomized interventions produce, because it is only a high bit value when all the states have different and high-probability effects, meaning that for each intervention, such as do(A) and do(B), you get an informative effect.

Reply
Natesh Ganesh
6/13/2017 09:46:13 am

Thanks, that cleared up a few things. So are all transition matrices in your papers structured the same way? That is, it is not the traditional Markov PTM but with rows and columns relating to different systems? I ask since I have seen your coarse graining examples and wonder when you go from a 4x4 to a 2x2 matrix, in those cases, you are dealing with the traditional Markov PTM.

Erik Hoel
6/13/2017 12:31:41 pm

Yup, the ones in the paper are traditional PTMs/TPMs with rows and columns (i,j) that dictate the probability of a state transition within a particular system, not two different systems.

Natesh Ganesh
6/19/2017 11:43:19 am

Sorry for the gap in response, but paper deadlines keep one busy.

See this is where I am confused again. Are you trying to establish the causal relationship between a system state (light bulb) and some other system (light switch) that might produce a change in the system (light bulb) state? Or is EI trying to capture the causal relationship between the state of one system say the light bulb, at time t vs its state at time (t+1)? Or some combination of both? There are these transition matrices for which you calculate EI and you have state diagrams, but it is hard to tell what exactly is happening without a reference to the driving signal. I feel like it isnt just important to know that a state 'a' can transition state 'b' with certain probability, but we also need to know for which input signal does this happen (something I do not see marked on these transition diagrams), which goes back to my earlier question. Your EI value of 1 bit for the simple light bulb-switch case seem to indicate you are trying to capture the relationship between the system bulb and the switch signal that is turning it off and on. Am I missing something here?

And there is also the issue of making a one-shot calculation at some time instant vs capturing the causal relationship over a time period. Again for the simple case of a light bulb-switch, a one-shot calculation of the causal relationship between the two seems to be captured. Is that the same for Markov chains given by those transition matrices? The EI value is calculated for a single time instant with uniform state distribution and given transition matrix?

Erik Hoel
6/20/2017 10:00:06 am

Hey Natesh - I've been adding some more to the primer, specifically concerning information theory for those interested. It should now gives you a clearer idea of what's going on (the papers also address these questions).

EI is a state-independent measure of causal structure, so it's not directly dependent on the current state or the current timestep, although it can be broken down into the effect information and the cause information, which are about the current state and the current timestep. Considered completely by itself and measured over all system states, EI captures the causal structure invariant of the input signal - that's why it uses the maximum entropy. Otherwise the causal relationship between a light switch and bulb would shift wildly in strength from day to day as people left the bulb on or off. The point is to isolate the information in the causal structure that's strictly conditional (if UP then ON, if DOWN then OFF).

P. Applebee
6/18/2017 12:40:27 pm

Thanks for writing this, Erik. I think some of the issues stem from what appear to be "mixed missaging" from you and people that have written about you. For example, everything in your first two sections here sounds totally reasonable and very interesting, yet on your Twitter page there is a tweet admonishing the idea that "my atoms made me do it". Yet here on your blog, it seems that you actually *aren't* making that argument: you're saying something much weaker, just that talking about atoms isn't the right level of description to explain "why you did something". In fact, wouldn't you admit that if physical determinism is true, your behavior *is* entirely determined by the behavior of your atoms?

Reply
Erik Hoel
6/18/2017 02:41:35 pm

Thanks for reading. Anything written for a popular audience about any subject won't be complete. In general, the theory is sketched pretty clearly in the papers, the primer, and essays. Given the theory, I think there's a good chance that higher scales really do cause behavior, in which case, yes, it would be incorrect to clam that "my atoms made me do it." To your question; causal emergence can occur in deterministic systems (and many people say we're not living in a deterministic universe anyways).

Reply
Stefan Schindler link
6/21/2017 04:45:45 am

I have not read all your work, nor all the comments, but "causal emergence" caught my attention and seems to me right on target, not least because it overlaps with (my modest understanding of) the "process philosophy" of Alfred North Whitehead (and, tangentially, a core notion in Buddhism). Which is to say: "causal emergence" plays a vital role in "creative evolution" as presented in Whitehead's (scientifically informed) metaphysics (and various Buddhist sutras). I'm not the person to elaborate on these two suggestions. I simply present them to you as provocative edifications -- potentially useful to you in your expanding research and interpretation, which might then also include Henri Bergson.

Reply
Erik Hoel
6/21/2017 01:47:44 pm

Thanks Stefan. I'll confess I'm not familiar with process philosophy, beyond being aware of its existence and associating it with Whitehead. I haven't read any Bergson either, although I know of his infamous debate with Einstein. Both are interesting suggestions. One thing is that I am trying to avoid too much metaphysics - people get up in arms about that sort of things very easily (and I've noticed it's more when other people describe my work than when they read my work). I think causation will be like information: something that started out seeming very philosophical in its questions and debates but then eventually got (mostly) skimmed away from philosophy into science and math. That's kind of what I'm hoping to contribute to here, but with causation and issues of reduction/emergence.

Reply
T. Anton
8/7/2017 04:03:54 am

Nice work, however, there are two avenues of attack that immediately occur to me:

1. Would it be possible to construct pathological coarse-grainings of a system that don't remotely reflect its general behaviour at any scale but still have high EI? If so, then using EI as a measure of relevant information would definitely fail in such cases.

2. Is there always a unique coarse-graining of a particular system with maximum EI (at least up to the internal symmetries of the system)? If not, it wouldn't be a fatal flaw, but if very different coarse-grainings could achieve maximal or near-maximal EI, that would indicate a problem to me, as it would appear to say that a system would have multiple natural scales of description.

Reply
Erik Hoel
8/7/2017 09:06:20 am

Hey T. Anton - thanks! Great questions.

1) EI reflects the coupling of the variables that make up a system (the constraint they put on one another). This is the causal structure, and it, in general, does reflect the behavior of a system. So one cannot construct pathological causal structures from finding maximums of EI, since EI is based on the definitional properties of causal structure. But your question was a little bit more subtle by involving behavior. One might, in a macabre gesture, use a brain as a paper weight. This doesn’t obviate the internal causal structure of the brain, even if that causal structure reflects nothing about its current behavior as a paper weight. So there are scenarios and contexts where the causal structure of a system isn’t reflected in its behavior / function in terms of the input/output. The maximum EI might be pathological in respect to some arbitrary input/output function that you can assign to the system (since there doesn’t seem any constraint about what input/output function one can assign). But the maximum EI will always be non-pathalogical as a description of the causal structure of that system (i.e., the set of elements and their relationships). More generally, the principles of causal emergence (emergence is like coding and is a form of noise reduction over causal relationships) should apply in general even if there were something wrong with EI that we haven’t figured out yet.

2) There can definitely be multiple macro-coarse grains with the same EI in the current version. I too think uniqueness is not much of an issue, as the number of potential bad descriptions which have been ruled out is massive compared to the very small number of constrained interpretations (I like to say they are “interpretations constrained by constraint”). One thing is that in the real world there will probably always be symmetry breaking such that one group wins. But your question about how different these groupings are is spot on. I agree in that if they were wildly different in type/form this would be a problem, as it would indicate an arbitrariness to the process of finding a maximum of EI. However, in general they are very similar (over the same elements) or have some more abstract similarity (like being a symmetrical grouping). Another way of putting it is that causal emergence is about noise reduction (this includes degeneracy), and the noise has to come from somewhere (innate in the system, instantiated in some mechanisms, from outside, if you’re just thinking practically from your measuring device, etc). All the groupings that win minimize the same noise, so always have some sensible relationship to one another. I’ve played with a lot of toy systems and never had groupings that felt *contradictory* even though they were different - they always have similarities or symmetries.

Reply
Len link
11/5/2018 02:49:44 pm

This is a copy (because I am lazy :)of the comment I just left on Scott's blog entry.

I have discovered this ongoing (or is it?) debate long after reading the Quanta article. But at the time, the mentioning in the article of Scott Aaronson's skepticism was sufficient to immediately make connection to his already well-known criticism of Tonnoni's Integrated Information Theory. It figures when you learn that Tonnoni was Hoel's PhD advisor, that this is merely a continuation of that old debate, even without looking at details. To my mind the entire conversation represents different philosophical stances traditionally assumed by physicists and biologists (putting Scott squarely into the physicist camp). Someone already mentioned in the above comment the Daniel Dennett's briliant analysis of the controversy which he did way back at 1991and which is still to my mind the best available guide to these occasionally popping up debates. Once you read Dennett, it becomes clear that there can be no winner in any such debate. Right vs. wrong math is only a matter of the initial assumptions which are indeed radically different in Hoel's work. My personal preference had always been with physicalism assumptions (which I don't identify with the hard core reductionism, but merely a method of work).

But what really made me make this comment long after the conversation seemingly ended (on this occasion) is another article I stumbled upon written by the true expert in the causal inference (Hoel even made a reference to his work in the Entropy paper). So here is this article which in my view supersedes all the arguments above by showing how and when the debate actually started with fascinating historical details:
https://aeon.co/essays/could-we-explain-the-world-without-cause-and-effect

Reply
ThosVarley link
3/5/2019 12:32:41 pm

I realize that this is an old post, but I'm curious about one thing:

I understand the idea that certain scales may be more informative than others (and that the most informative scale may not be the "foundational" one), but I'm struggling to connect "informativeness" to "causation." If you've got a simple toy model like a boolean network or a lightswitch, there's a clear association between "flipping the switch -> light comes on", which makes intuitive sense, but won't that break down for non-trivial systems (where hidden causal states might be a concern)?

Reply
Gabriel link
4/17/2019 05:31:19 pm

Great blog Erik

Reply



Leave a Reply.

    Get updates on new posts, papers, or essays:

    powered by TinyLetter

    RSS Feed

    Archives

    July 2017
    June 2017


​Please consider entering your email address to hear about new articles, essays, and books I'm working on. Used sparingly, it contains behind-the-scene looks at publication and research.

powered by TinyLetter

  • Home
  • Writing
  • Science
  • About
  • Appearances