#

Regret Minimization in Games with Incomplete Information (CFR)

The paper Regret Minimization in Games with Incomplete Information introduces counterfactual regret and how minimizing counterfactual regret through self-play can be used to reach Nash equilibrium. The algorithm is called Counterfactual Regret Minimization (CFR).

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games introduces Monte Carlo Counterfactual Regret Minimization (MCCFR), where we sample from the game tree and estimate the regrets.

We tried to keep our Python implementation easy-to-understand like a tutorial. We run it on a very simple imperfect information game called Kuhn poker.

Twitter thread

Introduction

We implement Monte Carlo Counterfactual Regret Minimization (MCCFR) with chance sampling (CS). It iteratively, explores part of the game tree by trying all player actions, but sampling chance events. Chance events are things like dealing cards; they are kept sampled once per iteration. Then it calculates, for each action, the regret of following the current strategy instead of taking that action. Then it updates the strategy based on these regrets for the next iteration, using regret matching. Finally, it computes the average of the strategies throughout the iterations, which is very close to the Nash equilibrium if we ran enough iterations.

We will first introduce the mathematical notation and theory.

Player

A player is denoted by $i \in N$ , where $N$ is the set of players.

History

History $h \in H$ is a sequence of actions including chance events, and $H$ is the set of all histories.

$Z \subseteq H$ is the set of terminal histories (game over).

Action

Action $a$ , $A (h) = a : (h, a) \in H$ where $h \in H$ is a non-terminal history.

Information Set $I_{i}$

Information set $I_{i} \in I_{i}$ for player $i$ is similar to a history $h \in H$ but only contains the actions visible to player $i$ . That is, the history $h$ will contain actions/events such as cards dealt to the opposing player while $I_{i}$ will not have them.

$I_{i}$ is known as the information partition of player $i$ .

$h \in I$ is the set of all histories that belong to a given information set; i.e. all those histories look the same in the eye of the player.

Strategy

Strategy of player $i$ , $σ_{i} \in Σ_{i}$ is a distribution over actions $A (I_{i})$ , where $Σ_{i}$ is the set of all strategies for player $i$ . Strategy on $t$ -th iteration is denoted by $σ^{t}_{i}$ .

Strategy is defined as a probability for taking an action $a$ in for a given information set $I$ ,

$σ_{i} (I) (a)$

$σ$ is the strategy profile which consists of strategies of all players $σ_{1}, σ_{2}, \dots$

$σ_{- i}$ is strategies of all players except $σ_{i}$

Probability of History

$π^{σ} (h)$ is the probability of reaching the history $h$ with strategy profile $σ$ . $π^{σ} (h)_{- i}$ is the probability of reaching $h$ without player $i$ 's contribution; i.e. player $i$ took the actions to follow $h$ with a probability of $1$ .

$π^{σ} (h)_{i}$ is the probability of reaching $h$ with only player $i$ 's contribution. That is, $π^{σ} (h) = π^{σ} (h)_{i} π^{σ} (h)_{- i}$

Probability of reaching a information set $I$ is, $π^{σ} (I) = h \in I \sum π^{σ} (h)$

Utility (Pay off)

The terminal utility is the utility (or pay off) of a player $i$ for a terminal history $h$ .

$u_{i} (h)$ where $h \in Z$

$u_{i} (σ)$ is the expected utility (payoff) for player $i$ with strategy profile $σ$ .

$u_{i} (σ) = h \in Z \sum u_{i} (h) π^{σ} (h)$

Nash Equilibrium

Nash equilibrium is a state where none of the players can increase their expected utility (or payoff) by changing their strategy alone.

For two players, Nash equilibrium is a strategy profile where

u_{1} (σ) u_{2} (σ) \geq m a x_{σ_{1}^{'} \in Σ_{1}} u_{1} (σ_{1}^{'}, σ_{2}) \geq m a x_{σ_{2}^{'} \in Σ_{2}} u_{1} (σ_{1}, σ_{2}^{'})

$ϵ$ -Nash equilibrium is,

u_{1} (σ) + ϵ u_{2} (σ) + ϵ \geq m a x_{σ_{1}^{'} \in Σ_{1}} u_{1} (σ_{1}^{'}, σ_{2}) \geq m a x_{σ_{2}^{'} \in Σ_{2}} u_{1} (σ_{1}, σ_{2}^{'})

Regret Minimization

Regret is the utility (or pay off) that the player didn't get because she didn't follow the optimal strategy or took the best action.

Average overall regret for Player $i$ is the average regret of not following the optimal strategy in all $T$ rounds of iterations.

$R_{i}^{T} = \frac{1}{T} m a x_{σ_{i}^{*} \in Σ_{i}} t = 1 \sum T (u_{i} (σ_{i}^{*}, σ^{t}_{- i}) - u_{i} (σ^{t}))$

where $σ^{t}$ is the strategy profile of all players in iteration $t$ , and

$(σ_{i}^{*}, σ^{t}_{- i})$

is the strategy profile $σ^{t}$ with player $i$ 's strategy replaced with $σ_{i}^{*}$ .

The average strategy is the average of strategies followed in each round, for all $I \in I, a \in A (I)$

$\overset{σ}{ˉ}_{i}^{T} (I) (a) = \frac{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I ) σ ^{t} ( I ) ( a )}{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I )}$

That is the mean regret of not playing with the optimal strategy.

If $R_{i}^{T} < ϵ$ for all players then $\overset{σ}{ˉ}_{i}^{T} (I) (a)$ is a $2 ϵ$ -Nash equilibrium.

R_{i}^{T} R_{i}^{T} < ϵ = \frac{1}{T} m a x_{σ_{i}^{*} \in Σ_{i}} t = 1 \sum T (u_{i} (σ_{i}^{*}, σ^{t}_{- i}) - u_{i} (σ^{t})) = \frac{1}{T} m a x_{σ_{i}^{*} \in Σ_{i}} t = 1 \sum T u_{i} (σ_{i}^{*}, σ^{t}_{- i}) - \frac{1}{T} t = 1 \sum T u_{i} (σ^{t}) < ϵ

Since $u_{1} = - u_{2}$ because it's a zero-sum game, we can add $R_{1}^{T}$ and $R_{i}^{T}$ and the second term will cancel out.

2 ϵ > \frac{1}{T} m a x_{σ_{1}^{*} \in Σ_{1}} t = 1 \sum T u_{1} (σ_{1}^{*}, σ^{t}_{- 1}) + \frac{1}{T} m a x_{σ_{2}^{*} \in Σ_{2}} t = 1 \sum T u_{2} (σ_{2}^{*}, σ^{t}_{- 2})

The average of utilities over a set of strategies is equal to the utility of the average strategy.

$\frac{1}{T} t = 1 \sum T u_{i} (σ^{t}) = u_{i} (\overset{σ}{ˉ}^{T})$

Therefore,

2 ϵ > m a x_{σ_{1}^{*} \in Σ_{1}} u_{1} (σ_{1}^{*}, \overset{σ}{ˉ}_{- 1}^{T}) + m a x_{σ_{2}^{*} \in Σ_{2}} u_{2} (σ_{2}^{*}, \overset{σ}{ˉ}_{- 2}^{T})

From the definition of $m a x$ , $m a x_{σ_{2}^{*} \in Σ_{2}} u_{2} (σ_{2}^{*}, \overset{σ}{ˉ}_{- 2}^{T}) \geq u_{2} (\overset{σ}{ˉ}^{T}) = - u_{1} (\overset{σ}{ˉ}^{T})$

Then,

2 ϵ u_{1} (\overset{σ}{ˉ}^{T}) + 2 ϵ > m a x_{σ_{1}^{*} \in Σ_{1}} u_{1} (σ_{1}^{*}, \overset{σ}{ˉ}_{- 1}^{T}) + - u_{1} (\overset{σ}{ˉ}^{T}) > m a x_{σ_{1}^{*} \in Σ_{1}} u_{1} (σ_{1}^{*}, \overset{σ}{ˉ}_{- 1}^{T})

This is $2 ϵ$ -Nash equilibrium. You can similarly prove for games with more than 2 players.

So we need to minimize $R_{i}^{T}$ to get close to a Nash equilibrium.

Counterfactual regret

Counterfactual value $v_{i} (σ, I)$ is the expected utility for player $i$ if if player $i$ tried to reach $I$ (took the actions leading to $I$ with a probability of $1$ ).

$v_{i} (σ, I) = z \in Z_{I} \sum π^{σ_{- i}} (z [I]) π^{σ} (z [I], z) u_{i} (z)$

where $Z_{I}$ is the set of terminal histories reachable from $I$ , and $z [I]$ is the prefix of $z$ up to $I$ . $π^{σ} (z [I], z)$ is the probability of reaching z from $z [I]$ .

Immediate counterfactual regret is,

$R_{i, i mm}^{T} (I) = m a x_{a \in A I} R_{i, i mm}^{T} (I, a)$

where

$R_{i, i mm}^{T} (I) = \frac{1}{T} t = 1 \sum T (v_{i} (σ^{t} ∣_{I \to a}, I) - v_{i} (σ^{t}, I))$

where $σ ∣_{I \to a}$ is the strategy profile $σ$ with the modification of always taking action $a$ at information set $I$ .

The paper proves that (Theorem 3),

$R_{i}^{T} \leq I \in I \sum R_{i, i mm}^{T, +} (I)$ where $R_{i, i mm}^{T, +} (I) = m a x (R_{i, i mm}^{T} (I), 0)$

Regret Matching

The strategy is calculated using regret matching.

The regret for each information set and action pair $R_{i}^{T} (I, a)$ is maintained,

r_{i}^{t} (I, a) R_{i}^{T} (I, a) = v_{i} (σ^{t} ∣_{I \to a}, I) - v_{i} (σ^{t}, I) = \frac{1}{T} t = 1 \sum T r_{i}^{t} (I, a)

and the strategy is calculated with regret matching,

σ_{i}^{T + 1} (I) (a) = ⎩ ⎨ ⎧ \frac{R _{i}^{T, +} ( I , a )}{\sum _{a^{'} \in A (I)} R _{i}^{T, +} ( I , a ^{'} )}, \frac{1}{∣ A ( I )∣}, i f \sum_{a^{'} \in A (I)} R_{i}^{T, +} (I, a^{'}) > 0 o t h erw i se

where $R_{i}^{T, +} (I, a) = m a x (R_{i}^{T} (I, a), 0)$

The paper The paper Regret Minimization in Games with Incomplete Information proves that if the strategy is selected according to above equation $R_{i}^{T}$ gets smaller proportionate to $\frac{1}{T}$ , and therefore reaches $ϵ$ -Nash equilibrium.

Monte Carlo CFR (MCCFR)

Computing $r_{i}^{t} (I, a)$ requires expanding the full game tree on each iteration.

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games shows we can sample from the game tree and estimate the regrets.

$Q = Q_{1}, \dots, Q_{r}$ is a set of subsets of $Z$ ( $Q_{j} \subseteq Z$ ) where we look at only a single block $Q_{j}$ in an iteration. Union of all subsets spans $Z$ ( $Q_{1} \cap \dots \cap Q_{r} = Z$ ). $q_{j}$ is the probability of picking block $Q_{j}$ .

$q (z)$ is the probability of picking $z$ in current iteration; i.e. $q (z) = \sum_{j : z \in Q_{j}} q_{j}$ - the sum of $q_{j}$ where $z \in Q_{j}$ .

Then we get sampled counterfactual value fro block $j$ ,

$\tilde{v} (σ, I ∣ j) = z \in Q_{j} \sum \frac{1}{q ( z )} π^{σ_{- i}} (z [I]) π^{σ} (z [I], z) u_{i} (z)$

The paper shows that

$E_{j \sim q_{j}} [\tilde{v} (σ, I ∣ j)] = v_{i} (σ, I)$

with a simple proof.

Therefore we can sample a part of the game tree and calculate the regrets. We calculate an estimate of regrets

$\tilde{r}_{i}^{t} (I, a) = \tilde{v}_{i} (σ^{t} ∣_{I \to a}, I) - \tilde{v}_{i} (σ^{t}, I)$

And use that to update $R_{i}^{T} (I, a)$ and calculate the strategy $σ_{i}^{T + 1} (I) (a)$ on each iteration. Finally, we calculate the overall average strategy $\overset{σ}{ˉ}_{i}^{T} (I) (a)$ .

Here is a Kuhn Poker implementation to try CFR on Kuhn Poker.

Let's dive into the code!

328from typing import NewType, Dict, List, Callable, cast
329
330from labml import monit, tracker, logger, experiment
331from labml.configs import BaseConfigs, option

#

A player $i \in N$ where $N$ is the set of players

334Player = NewType('Player', int)

#

Action $a$ , $A (h) = a : (h, a) \in H$ where $h \in H$ is a non-terminal history

336Action = NewType('Action', str)

#

History

History $h \in H$ is a sequence of actions including chance events, and $H$ is the set of all histories.

This class should be extended with game specific logic.

339class History:

#

Whether it's a terminal history; i.e. game over. $h \in Z$

351    def is_terminal(self):

#

356        raise NotImplementedError()

#

Utility of player $i$ for a terminal history. $u_{i} (h)$ where $h \in Z$

358    def terminal_utility(self, i: Player) -> float:

#

364        raise NotImplementedError()

#

Get current player, denoted by $P (h)$ , where $P$ is known as Player function.

If $P (h) = c$ it means that current event is a chance $c$ event. Something like dealing cards, or opening common cards in poker.

366    def player(self) -> Player:

#

373        raise NotImplementedError()

#

Whether the next step is a chance step; something like dealing a new card. $P (h) = c$

375    def is_chance(self) -> bool:

#

380        raise NotImplementedError()

#

Sample a chance when $P (h) = c$ .

382    def sample_chance(self) -> Action:

#

386        raise NotImplementedError()

#

Add an action to the history.

388    def __add__(self, action: Action):

#

392        raise NotImplementedError()

#

Get information set for the current player

394    def info_set_key(self) -> str:

#

398        raise NotImplementedError

#

Create a new information set for the current player

400    def new_info_set(self) -> 'InfoSet':

#

404        raise NotImplementedError()

#

Human readable representation

406    def __repr__(self):

#

410        raise NotImplementedError()

#

Information Set $I_{i}$

413class InfoSet:

#

Unique key identifying the information set

421    key: str

#

$σ_{i}$ , the strategy of player $i$

423    strategy: Dict[Action, float]

#

Total regret of not taking each action $A (I_{i})$ ,

\tilde{r}_{i}^{t} (I, a) R_{i}^{T} (I, a) = \tilde{v}_{i} (σ^{t} ∣_{I \to a}, I) - \tilde{v}_{i} (σ^{t}, I) = \frac{1}{T} t = 1 \sum T \tilde{r}_{i}^{t} (I, a)

We maintain $T R_{i}^{T} (I, a)$ instead of $R_{i}^{T} (I, a)$ since $\frac{1}{T}$ term cancels out anyway when computing strategy $σ_{i}^{T + 1} (I) (a)$

438    regret: Dict[Action, float]

#

We maintain the cumulative strategy $t = 1 \sum T π_{i}^{σ^{t}} (I) σ^{t} (I) (a)$ to compute overall average strategy

$\overset{σ}{ˉ}_{i}^{T} (I) (a) = \frac{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I ) σ ^{t} ( I ) ( a )}{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I )}$

445    cumulative_strategy: Dict[Action, float]

#

Initialize

447    def __init__(self, key: str):

#

451        self.key = key
452        self.regret = {a: 0 for a in self.actions()}
453        self.cumulative_strategy = {a: 0 for a in self.actions()}
454        self.calculate_strategy()

#

Actions $A (I_{i})$

456    def actions(self) -> List[Action]:

#

460        raise NotImplementedError()

#

Load information set from a saved dictionary

462    @staticmethod
463    def from_dict(data: Dict[str, any]) -> 'InfoSet':

#

467        raise NotImplementedError()

#

Save the information set to a dictionary

469    def to_dict(self):

#

473        return {
474            'key': self.key,
475            'regret': self.regret,
476            'average_strategy': self.cumulative_strategy,
477        }

#

Load data from a saved dictionary

479    def load_dict(self, data: Dict[str, any]):

#

483        self.regret = data['regret']
484        self.cumulative_strategy = data['average_strategy']
485        self.calculate_strategy()

#

Calculate strategy

Calculate current strategy using regret matching.

σ_{i}^{T + 1} (I) (a) = ⎩ ⎨ ⎧ \frac{R _{i}^{T, +} ( I , a )}{\sum _{a^{'} \in A (I)} R _{i}^{T, +} ( I , a ^{'} )}, \frac{1}{∣ A ( I )∣}, i f \sum_{a^{'} \in A (I)} R_{i}^{T, +} (I, a^{'}) > 0 o t h erw i se

where $R_{i}^{T, +} (I, a) = m a x (R_{i}^{T} (I, a), 0)$

487    def calculate_strategy(self):

#

$R_{i}^{T, +} (I, a) = m a x (R_{i}^{T} (I, a), 0)$

506        regret = {a: max(r, 0) for a, r in self.regret.items()}

#

$a^{'} \in A (I) \sum R_{i}^{T, +} (I, a^{'})$

508        regret_sum = sum(regret.values())

#

if $\sum_{a^{'} \in A (I)} R_{i}^{T, +} (I, a^{'}) > 0$ ,

510        if regret_sum > 0:

#

$σ_{i}^{T + 1} (I) (a) = \frac{R _{i}^{T, +} ( I , a )}{\sum _{a^{'} \in A (I)} R _{i}^{T, +} ( I , a ^{'} )}$

513            self.strategy = {a: r / regret_sum for a, r in regret.items()}

#

Otherwise,

515        else:

#

$∣ A (I)∣$

517            count = len(list(a for a in self.regret))

#

$σ_{i}^{T + 1} (I) (a) = \frac{1}{∣ A ( I )∣}$

520            self.strategy = {a: 1 / count for a, r in regret.items()}

#

Get average strategy

$\overset{σ}{ˉ}_{i}^{T} (I) (a) = \frac{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I ) σ ^{t} ( I ) ( a )}{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I )}$

522    def get_average_strategy(self):

#

$t = 1 \sum T π_{i}^{σ^{t}} (I) σ^{t} (I) (a)$

531        cum_strategy = {a: self.cumulative_strategy.get(a, 0.) for a in self.actions()}

#

$t = 1 \sum T π_{i}^{σ^{t}} (I) = a \in A (I) \sum t = 1 \sum T π_{i}^{σ^{t}} (I) σ^{t} (I) (a)$

535        strategy_sum = sum(cum_strategy.values())

#

If $\sum_{t = 1}^{T} π_{i}^{σ^{t}} (I) > 0$ ,

537        if strategy_sum > 0:

#

$\overset{σ}{ˉ}_{i}^{T} (I) (a) = \frac{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I ) σ ^{t} ( I ) ( a )}{\sum _{t = 1}^{T} π _{i}^{σ^{t}} ( I )}$

541            return {a: s / strategy_sum for a, s in cum_strategy.items()}

#

Otherwise,

543        else:

#

$∣ A (I)∣$

545            count = len(list(a for a in cum_strategy))

#

$\overset{σ}{ˉ}_{i}^{T} (I) (a) = \frac{1}{∣ A ( I )∣}$

548            return {a: 1 / count for a, r in cum_strategy.items()}

#

Human readable representation

550    def __repr__(self):

#

554        raise NotImplementedError()

#

Counterfactual Regret Minimization (CFR) Algorithm

We do chance sampling (CS) where all the chance events (nodes) are sampled and all other events (nodes) are explored.

We can ignore the term $q (z)$ since it's the same for all terminal histories since we are doing chance sampling and it cancels out when calculating strategy (common in numerator and denominator).

557class CFR:

#

$I$ set of all information sets.

570    info_sets: Dict[str, InfoSet]

#

create_new_history creates a new empty history
epochs is the number of iterations to train on $T$
n_players is the number of players

572    def __init__(self, *,
573                 create_new_history: Callable[[], History],
574                 epochs: int,
575                 n_players: int = 2):

#

581        self.n_players = n_players
582        self.epochs = epochs
583        self.create_new_history = create_new_history

#

A dictionary for $I$ set of all information sets

585        self.info_sets = {}

#

Tracker for analytics

587        self.tracker = InfoSetTracker()

#

Returns the information set $I$ of the current player for a given history $h$

589    def _get_info_set(self, h: History):

#

593        info_set_key = h.info_set_key()
594        if info_set_key not in self.info_sets:
595            self.info_sets[info_set_key] = h.new_info_set()
596        return self.info_sets[info_set_key]

#

Walk Tree

This function walks the game tree.

h is the current history $h$
i is the player $i$ that we are computing regrets of
pi_i is $π_{i}^{σ^{t}} (h)$
pi_neg_i is $π_{- i}^{σ^{t}} (h)$

It returns the expected utility, for the history $h$ $z \in Z_{h} \sum π^{σ} (h, z) u_{i} (z)$ where $Z_{h}$ is the set of terminal histories with prefix $h$

While walking the tee it updates the total regrets $R_{i}^{T} (I, a)$ .

598    def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> float:

#

If it's a terminal history $h \in Z$ return the terminal utility $u_{i} (h)$ .

619        if h.is_terminal():
620            return h.terminal_utility(i)

#

If it's a chance event $P (h) = c$ sample a and go to next step.

622        elif h.is_chance():
623            a = h.sample_chance()
624            return self.walk_tree(h + a, i, pi_i, pi_neg_i)

#

Get current player's information set for $h$

627        I = self._get_info_set(h)

#

To store $\sum_{z \in Z_{h}} π^{σ} (h, z) u_{i} (z)$

629        v = 0

#

To store $z \in Z_{h} \sum π^{σ^{t} ∣_{I \to a}} (h, z) u_{i} (z)$ for each action $a \in A (h)$

633        va = {}

#

Iterate through all actions

636        for a in I.actions():

#

If the current player is $i$ ,

638            if i == h.player():

#

π_{i}^{σ^{t}} (h + a) π_{- i}^{σ^{t}} (h + a) = π_{i}^{σ^{t}} (h) σ^{t}_{i} (I) (a) = π_{- i}^{σ^{t}} (h)

643                va[a] = self.walk_tree(h + a, i, pi_i * I.strategy[a], pi_neg_i)

#

Otherwise,

645            else:

#

π_{i}^{σ^{t}} (h + a) π_{- i}^{σ^{t}} (h + a) = π_{i}^{σ^{t}} (h) = π_{- i}^{σ^{t}} (h) * σ^{t}_{i} (I) (a)

650                va[a] = self.walk_tree(h + a, i, pi_i, pi_neg_i * I.strategy[a])

#

$z \in Z_{h} \sum π^{σ} (h, z) u_{i} (z) = a \in A (I) \sum [σ^{t}_{i} (I) (a) z \in Z_{h} \sum π^{σ^{t} ∣_{I \to a}} (h, z) u_{i} (z)]$

655            v = v + I.strategy[a] * va[a]

#

If the current player is $i$ , update the cumulative strategies and total regrets

659        if h.player() == i:

#

Update cumulative strategies $t = 1 \sum T π_{i}^{σ^{t}} (I) σ^{t} (I) (a) = t = 1 \sum T [h \in I \sum π_{i}^{σ^{t}} (h) σ^{t} (I) (a)]$

664            for a in I.actions():
665                I.cumulative_strategy[a] = I.cumulative_strategy[a] + pi_i * I.strategy[a]

#

\tilde{r}_{i}^{t} (I, a) T R_{i}^{T} (I, a) = \tilde{v}_{i} (σ^{t} ∣_{I \to a}, I) - \tilde{v}_{i} (σ^{t}, I) = π_{- i}^{σ^{t}} (h) (z \in Z_{h} \sum π^{σ^{t} ∣_{I \to a}} (h, z) u_{i} (z) - z \in Z_{h} \sum π^{σ} (h, z) u_{i} (z)) = t = 1 \sum T \tilde{r}_{i}^{t} (I, a)

678            for a in I.actions():
679                I.regret[a] += pi_neg_i * (va[a] - v)

#

Update the strategy $σ^{t} (I) (a)$

682            I.calculate_strategy()

#

Return the expected utility for player $i$ , $z \in Z_{h} \sum π^{σ} (h, z) u_{i} (z)$

686        return v

#

Iteratively update $σ^{t} (I) (a)$

This updates the strategies for $T$ iterations.

688    def iterate(self):

#

Loop for epochs times

696        for t in monit.iterate('Train', self.epochs):

#

Walk tree and update regrets for each player

698            for i in range(self.n_players):
699                self.walk_tree(self.create_new_history(), cast(Player, i), 1, 1)

#

Track data for analytics

702            tracker.add_global_step()
703            self.tracker(self.info_sets)
704            tracker.save()

#

Print the information sets

707        logger.inspect(self.info_sets)

#

Information set tracker

This is a small helper class to track data from information sets

710class InfoSetTracker:

#

Set tracking indicators

716    def __init__(self):

#

720        tracker.set_histogram(f'strategy.*')
721        tracker.set_histogram(f'average_strategy.*')
722        tracker.set_histogram(f'regret.*')

#

Track the data from all information sets

724    def __call__(self, info_sets: Dict[str, InfoSet]):

#

728        for I in info_sets.values():
729            avg_strategy = I.get_average_strategy()
730            for a in I.actions():
731                tracker.add({
732                    f'strategy.{I.key}.{a}': I.strategy[a],
733                    f'average_strategy.{I.key}.{a}': avg_strategy[a],
734                    f'regret.{I.key}.{a}': I.regret[a],
735                })

#

Configurable CFR module

738class CFRConfigs(BaseConfigs):

#

742    create_new_history: Callable[[], History]
743    epochs: int = 1_00_000
744    cfr: CFR = 'simple_cfr'

#

Initialize CFR algorithm

747@option(CFRConfigs.cfr)
748def simple_cfr(c: CFRConfigs):

#

752    return CFR(create_new_history=c.create_new_history,
753               epochs=c.epochs)

Regret Minimization in Games with Incomplete Information (CFR)

Introduction

Player

History

Action

Information Set Ii​

Strategy

Probability of History

Utility (Pay off)

Nash Equilibrium

Regret Minimization

Counterfactual regret

Regret Matching

Monte Carlo CFR (MCCFR)

History

Information Set Ii​

Calculate strategy

Get average strategy

Counterfactual Regret Minimization (CFR) Algorithm

Walk Tree

Iteratively update σt(I)(a)

Information set tracker

Configurable CFR module

Information Set $I_{i}$

Information Set $I_{i}$

Iteratively update $σ^{t} (I) (a)$