Cisco ACI Notes

This is a collection of my Cisco ACI notes during my studies.

Imperative Model vs Declarative Model

  • There are two operational models, as to how the hardware reacts to the intent of the network administrator: imperative or declarative.
  • The imperative model is what we, network engineers, did for years before the appearance of ACI: we tell network equipement how to implement a feature or a protocol by “programming” them. the result is immediately seeable.
  • The declarative model is however when we tell the hardware where and when we need such and such features, but we do not tell the hardware how to implement them.
  • ACI builds by default a zero-based trust network, i.e communication is not allowed unless specified, which is opposite to the traditional network which is trust-based.

APIC Initial Setup

The APIC server is a Cisco UCS-C series. It has generally these types of physical ports:

APIC-rear-view
  • number 2: IP OOB management network ports
  • number 4: the CIMC management port. This is where you plug an Ethernet cable and access the CIMC web GUI with its IP address to manage the physical server, i.e. the UCS chassis. Note that on the APIC, the CIMC management IP address and the IP OOB management address are two different concepts!
  • number 5: console port. This is where you plug a console cable, or a terminal server to remotely access the chassis console.
  • number 9: ports to connect to the fabrics. In APIC-L3 we have four ports grouped in pairs: the first and second ports form one pair, the third and the fourth form another pair. To connect APIC we may use a maximum of one port from each pair to connect to leafs: one from the first pair, and one from the second pair.

Initially we must configure CIMC with certain parameters. Once CIMC IP settings are done, we can configure APIC further with the CIMC virtual console (CIMC GUI) or through a direct connection (keyboard and mouse).

  • physically APIC requires 2x 10G connections to the fabric, 2x connections to the Out-of-Band management network and one CIMC connection.
  • The CIMC port is different from the IP Out-of-Band management port.
  • configure parameters such as
    • Controller ID
    • Controller Name
    • Fabric Name
    • Infrastructure VLAN
    • VTEP address pool
    • Out-of-Band Management IP address and gateway
    • Multicast group address
  • once the following parameters are set up during the script install, you can not change them later unless you rebuild the fabric:
    • Infrastructure VLAN
    • VTEP address pool
  • Adding a switch that has a different configuration to the fabric won’t overwrite the fabric configuration, contrary to what most network engineers would think. In fact, there is no such a thing like VTP in the ACI fabric.

Shard

A Shard is the smallest unit of data in a database. Understanding this concept is critical for the operation of APIC clusters. In Cisco ACI, each Shard has three instances: one active instance and two backups. And APIC servers in an APIC cluster share the three instances, in a way that depends on how many APICs we have.

So if we have only one APIC, he will hold all three instances of each Shard, which equivalent of 100% losing data when the APIC crashes. If we have three APICs, one of them will hold the active instance and the other APICs will hold a backup each.

APIC Cluster

An APIC cluster is composed of at least two APICs and a maximum of 7, as of ACI Release 4.1. Cisco recommends to design APIC clusters in sizes of 3, 5 or 7 APICs in order to preserve the minority/majority in terms of Shard and avoid split-brain APIC scenarios. So always have odd numbers of APICs in your cluster. The minimum recommended APIC cluster size is then 3. A fourth APIC can be added to the cluster but stays in Cold Standby state. The cold standby APIC:

  • does not participate in policy definition
  • becomes active when an APIC goes down. In this case we need to add a new cold standby APIC to the cluster
  • has its firmware updated whenever the firmware of the active APICs in the cluster are upgraded.

A bad ACI controller design is to have an APIC cluster of 2, 4 or 6 controllers.

In a multi-site ACI design, we must have at any time a site that contains the highest number of APIC servers. This site is said to have the “quorum”, meaning “the majority of APICs”.

All APICs in a cluster must have the same ACI firmware version. So when you have a running cluster of APICs and you want to add a new APIC, you need to upgrade him first to the same ACI firmware as the active APICs before configuring him to join them.

Before adding an APIC or upgrading an APIC cluster, ensure that each APIC’s health is 100% (“fully fit”).

If all APICs fail, the fabric continues to forward packets as normal.

APIC servers are staged one after the other, unless we are in a multipod setting, where not all APICs are on the same physical location. In this case we must set up the Pod environments first then stage the remote APICs.

APIC clusters of 5 allow to have active/standby operation: 3 APICs are active and 2 are standby.

Since Cisco APIC is based on Linux, you can access its Bash shell by issuing the following command:

apic# bash

you will find yourself with this new prompt:

admin@apic:> 

Vmware VDS vs Cisco AVS

Vmware VDS

  • is purely a L2 virtual switch.
  • spans one or more virtualization host, unlike the Virtual Standard Switch VSS (not to confuse with VSS feature on Catalyst switches)
    • VSS is not supported when integrating ACI with vCenter.
  • can either be managed by vCenter or by ACI when the vSphere environment is integrated to ACI through a VMM Domain. In the latter case, we call it an ACI-managed DVS.
  • does not support OpFlex.
  • supports CDP and LLDP.
  • has a closed code that only VMware has access to. That is why APIC follows an imperative model with DVS

Cisco AVS

  • Application Virtual Switch, is a multi hypervisor (compatible with many hypervisor vendors) virtual switch that comes with ACI free of charge
  • is built on the successful 1000V switch.
  • is completely managed by ACI, unlike the 1000V that was managed by a Virtual Supervisor Module VSM.
  • supports L2/L3 functionality
  • supports OpFlex
  • is supported by VMware vSphere up to vSphere 6.5.
  • integrates a Distributed Firewall which can be in disabled mode, Learning mode or Enabled mode.
  • supports both VLAN and VXLAN encapsulations
  • supports neither intra-EPG isolation nor intra-EPG contracts
  • its successor is the Cisco ACI Virtual Edge (AVE)

The choice of using VDS or Cisco AVS is made in the menu during the configuration of the VMM domain integration.

Local Station Table, Global Station Table, Proxy Station Table

  • LST (Local Station Table):
    • exists on each leaf
    • contains entries of the directly attached end points
    • entries are in the format “Endpoint IP address — VTEP address”
  • GST (Global Station Table):
    • found on each leaf
    • builds entries based on learned traffic from non-local endpoints, e.g endpoints that are attached to other leafs.
    • entries are in the format “Endpoint IP address — VTEP address”
  • PST (Proxy Station Table):
    • found only on spines
    • entries are in the format “Endpoint IP address — VTEP address”.

Integrating Workloads with ACI

  • In the IT industry we distinguish physical workloads and virtual workloads.
  • A physical workload is a subset of compute, storage and network resources dedicated to a single physical entity or machine.
  • A virtual Workload is the same subset being used by a virtual machine.
  • When integrating a physical workload with ACI:
    • we should very likely configure policies on each physical NIC or virtual NIC on the server.
    • we configure static path binding on the EPG.
  • To integrate ACI with Microsoft platforms, we have two options:
    • integration with Microsoft SCVMM
    • integration with Azure Pack
      • provides ready-to-use management portal and administrator portal
      • reflects the same experience as Microsoft Azure cloud.

ACI Migration Modes

when companies migrate from a traditional network to ACI, they can adopt either one of the following approaches:

  • network-centric mode:
    • the network administrator creates a subnet per VLAN per Bridge Domain and put their servers that were in this VLAN on an EPG. This mode is also known as the VLAN=BD=EPG mode.
    • in this mode we take our existing VLANs and subnets from the old network and create them in ACI; i.e. we “reproduce” the network in ACI. 
    • can be a one-tenant or a multi-tenant setup.
  • Application-centric mode .
  • Hybrid mode: some servers remain grouped by VLAN, some others according to another criteria such as by application or by business need, rather than by VLAN as it was in the traditional network. The hybrid mode is the combination of the network-centric mode and some features from the application-centric mode.

      There is nothing wrong with either migration modes, i.e. you are not forced to migrate to the application-centric mode if you don’t have a need to. Always ask the question “Is my customer happy with my network design?”

      ACI Policies

      Blacklist Model vs Whitelist Model

      In an organisation the corporate security guidelines and policies follow one of the following security models: blacklist or whitelist. In a blacklist model everything is open unless specifically denied. In the whitelist model every communication is denied unless specifically authorized. A quick analogy to understand the whitelist model is Cisco IOS Access Lists.

      VLAN Pools

      Overlay vs Underlay

      • VXLAN forms the overlay network in ACI
      • IS-IS builds the underlay network, which runs transparently in the ACI fabric without user intervention.
      • MDT (Multicast Distribution Tree) runs also on the underlay network.

      VXLAN

      • VXLAN in the Cisco ACI fabric is different from the standard VXLAN protocol.
      • VXLAN offers 16 Million subnets. Each segment is distinguished with a VXLAN Network Identifier (VNI).
      • In ACI, VXLAN works in IRB mode.
      • Just as VLANs segment a traditional network, VXLAN segments the network too. We call them simply VXLAN segments.
      • Each VXLAN segment is uniquely identified by a VNI or VNID (VXLAN Network IDentifier). A VXLAN segment is a L2 broadcast segment.
      • At the ingress leaf, usually the endpoint generates traffic that is encapsulated in VLANs (some endpoints support VXLAN and NVGRE too!). The ACI fabric leaf:
        • encapsulates the ingress frame in a Cisco VXLAN packet that contains both UDP and VXLAN headers. The VXLAN header includes the right VNI (there is a VNI-to-VLAN mapping on each fabric leaf).
        • encapsulates the VXLAN packet into an IP packet
        • routes the packet to the destination VTEP using the IS-IS underlay network.
      • At the leaf egress, the packet is decapsulated from the Cisco VXLAN header and encapsulated into a VLAN frame (or whichever the encapsulation at the endpoint is).

      • There an entry “deny any” at the end of the ACL. Cisco ACI employs the whitelist model by default.
      • Unknown Multicast traffic is a multicast traffic crossing the ACI fabric without a IGMP Join message.

      VTEP aka TEP

      VTEP, Virtual Tunnel EndPoint addresses, or simply TEP, defines either the virtual tunneling technology or the tunnel endpoint address.

      VTEP pool is the the address pool that we will assign to the TEP devices, our ACI spines and leafs being the VTEP devices. Technically a VTEP poo is a subnet.

      Each ACI fabric node requires a VTEP address to be able to route packets internally to other fabric nodes:

      We can call it also VTEP prefix. It is defined during an initial APIC setup and is recommended to be a /16 or a /17 subnet. By default, The VTEP pool has the subnet 10.0.0.0/16. And starting from ACI version 2 we can configure a VTEP pool as a /22 subnet.

      Switches in a Pod – whether they are leafs or spines- share the same VTEP prefix. I said “switches” because an APIC does not have a VTEP address.

      We can display the VTEP pool

      VMware NIC cards

      • vNIC: the virtual NIC on a VM
      • vmnic: aka pNIC: the physical NIC on the virtualization host
      • vmknic: the vmnic on the hypervisor itself, used to transport infrastructure traffic from/to the hypervisor itself.

      ACI Traffic Classification

      ACI fabric performs traffic classification when an end host or a NIC is attached to it. The purpose is to be able to correctly assign the endpoint to one preconfigured EPG.

      Traffic classification is based on one of the following criteria, depending on whether we attach a physical workload or a virtual workload, or whether we use an ACI-managed DVS or AVS:

      • source MAC address
      • source IP address
      • port and VLAN encapsulation
      • port and VXLAN encapsulation
      • etc.

      Opflex

      • is a declarative policy distribution model.
      • is supported on ACI components and many physical and virtual switches on the market.
      • is used by APIC to communicate with the Nexus 9000 in the ACI fabric.
      • was presented as a draft to IETF. It is as of 2020 still not standardized yet.
      • is a southbound protocol.
      • is not supported on Vmware VDS.
      • is supported on Microsoft vSwitch, Openstack OVS and Kubernetes OVS.
      • runs on the ACI Infrastructure VLAN.
      • provides visibility of virtual switches in Openstack environments
      • When a virtual switch supports Opflex, we must extend the ACI infrastructure VLAN outside of the fabric, which we can perform in the AAEP configuration menu.
      • finds out which physical ports of the virtualization host are connected to leaf ports, if the virtualization host is directly plugged to ACI fabric. Otherwise, LLDP (enabled by default) or CDP (not enabled by default) is used. Remember that a virtual switch connects to physical NIC ports of the virtualization host, and the physical NIC ports connect to the ACI fabric.
      • An Opflex proxy is built in every ACI leaf. It serves to interact with an OVS Opflex agent, whenever we integrate ACI with Openstack.
      • The Opflex protocol runs on TCP 8009.

      • NTP must be configured and synchronized on APIC and all fabric nodes. Here is a quick tutorial on setting up NTP on ACI.
      • New nodes being added to the fabric are automatically discovered by APIC through LLDP. As soon as they pop up in the APIC GUI Interface you can add or block them from joining the fabric, based on their Serial Numbers.
      • New fabric nodes send DHCP requests and receive replies from APIC.
      • APIC sends TEP addresses to the new leafs
      • Giving lower numerical IDs to the spines is recommended. The subsequent higher IDs should be reserved for the leafs.
      • All fabric nodes and APICs should be connected to an OOB network for management purposes. The same OOB network carries traffic to and from a virtual manager like vCenter, when the vCenter is integrated in ACI (see VMM domains).
      • Access to leaf switches through console cable is possible but offers only read capabilities.
      • OS image management occurs on the APIC, which supports TFTP
      • in ACI there is no need to:
        • configure loopback addresses on new switches
        • configure IGP protocol and neighborships
        • configure custom routing timers
        • configure list of allowed VLANs on trunks.
      • Management of the fabric can be performed also using an external management station connected to the fabric on tenant “mgmt”. In this scenario you must:
        • configure a VLAN Pool, an AEP, a physical domain
        • assign the VLAN Pool to the domain
        • encapsulate the domain under the AEP
      • Provisioning a switch port in traditional networks is completely different from the ACI world:
        • in a traditional switch you configure interfaces separately
        • in ACI, you configure many constructs and objects at first, sch as domain, AEP, VLAN Pool, Switch Profile, Interface Profile… which may seem a burden at first. But its power lays with its flexibility and extensibility. For example if you want to add an interface with similar configuration to a previous one, simply add it to the Interface Profile.
      • an Application in the ACI model ist not a virtual/physical machine, but the combination of:
        • workloads, either physical or virtual
        • L2 – L7 policies: VLANs, subnets, L4 ports, ACL, QoS policies, filtering policies, load balancing policies,…
      • In terms of number of supported Spines, the ACI fabric supports a minimum of 2 and a maximum of 6, in even numbers (2, 4, 6).
      • ACI fabric operates on a whitelist model: no communication is allowed unless specified.
      • Frames in ACI are routed, but the L2 switching semantics are preserved.
      • Infrastructure VLAN
        • is used within the fabric
        • must be unique on the whole network, including end host VLANs
        • must be extended (manually configured) to Blade Systems
        • recommended but not mandatory: use VLAN ID 3967.
      • ACI Basic vs Advanced GUI
        • Basic GUI
          • use cases:
            • for small ACI deployments
            • for network administrators who do not need full ACI features such as L4-7 integration.
          • allows configuration of tenants, leaf ports and access profiles
          • allows to configure one port at a time
          • is above ACI v3.0 not supported anymore
        • Advanced GUI
          • allows to configure multiple ports through the access selector and Interface Profiles
          • recommended to be used.
      • VXLAN overlay:
        • the virtual network built using all VTEP addresses of the fabric nodes (leafs and spines).
        • VXLAN packets are routed over the fabric underlay network.
      • Fabric underlay: is the IS-IS network topology that ensures that packets are routed from one fabric node to another. This network runs under the hood and does not need any intervention from the APIC administrator. All routes in the underlay are host routes (/32 subnet masks)
      • Software overlay network:
        • not to be confused with the VXLAN Overlay of the fabric.
        • the logical network built between virtual switches that are located on a hypervisor.
        • When a virtualized server is a dual hypervisor, then each hypervisor runs its own software network overlay, and both network overlays do not communicate with each other.
        • the software overlay network does not communicate with the physical network either unless a software gateway is installed.
      • Dockers (equivalent to VM) in the Linux Docker technology doe not have their own TCP/IP stack but rather a namespace in the TCP/IP stack of the host machine.
      • A Blade system (or Blade Chassis) is composed of Blade servers and Blade Switches
        • Blade Switches are physical
        • Blade Servers are physical and contain Virtual Switches
      • ACI plugin for vCenter
        • allows virtualization administrators to interact with APIC in an easy way without the requirement to have prior networking knowledge:
        • virtualization administrator can add/delete/modify ACI constructs (tenants, VRF, Bridge Domains, App Profile, EPG, uSeg EPG), add/modify Port Groups, add/modify VM to Port group associations, etc.

      Endpoint Learning

      • there are three so-called Station tables
        • local station table:
          • each leaf has a local station table
          • contains all endpoints connected to the local leaf
        • global station table
          • each leaf has a global station table
          • contains cached information about some remote endpoints. Leafs are not supposed to possess forwarding information about all endpoints in the fabric.
        • proxy station table
          • resides on the spines
          • all the spines have the same proxy station table
          • contains forwarding information about all endpoints attached to the leafs (aka the endpoint reachability information).
            • The endpoint reachability information includes:
              • L2 information: VLAN, endpoint MAC address
              • L3 information: endpoint IP address
              • location information: Leaf ID, access port ID.

      Microsegmentation

      • leads to the distinction between the original EPG (aka Base EPG) and microsegmented EPG (aka uSeg EPG)
      • The purpose of microsegmenting EPG is to automate the assignment of selected Virtual Machines to a particular EPG using rules, instead of the VMware administrator having to manually assigning them.
      • Each rule is in the format “match-any | match-all {u-attribute}”, where u-attributes are the microsegmentation attributes
      • Only two u-attributes are supported by uSeg EPG when attached to bare metal servers:
        • IP Address
        • MAC Address
      • the list of available u-attributes of an uSeg EPG attaching to a VMM domain is richer:
        • IP Address
        • MAC Address
        • VM Name
        • VM OS
        • VM tag
      • a rule can be a pure “match-any” filter, a pure “match-all” filter, or a combination of both.
      • if there are many clauses in the rule, than beware of the precedence among the u-attributes, e.g. the u-attribute “VM Name” has a higher precedence than “VM tag”. So if the u-attribute “VM Name” matches first, further clauses of the rule won’t be inspected by APIC.
      • available for both physical and VMM Domains

      ACI Fabric Multi Site Design For Active-Active Data Centers

      Cisco has determined the following design topologies when dealing with an ACI fabric on multiple sites:

      ACI Stretched Fabric Design

      In a stretched fabric ACI design, the ACI fabric is – like its name mentions- stretched on both sites. We still have:

      • one APIC cluster: one APIC is installed on one site and two APIC on the other site,
      • One control plane, one data plane,

      Regarding the leafs, we have:

      • some leafs from site A physically connect to some spines of site B,
      • transit leafs: which are leafs that connect the sites together,
      • partial or full-meshed physical topology between leafs and spines.

      The Round Trip Time between sites must be less than 10ms.

      The Data Center Interconnect link is one of the following options:

      • for a maximum of 40Gbps throughput, we can choose DWDM or a dark fiber
      • for 100Gbps, there is Ethernet over MPLS (EoMPLS) pseudo-wire technology.

      When the DCI link goes down, we have a split brain situation. In this case, the APIC cluster minority operates in read-only mode.

      A stretched ACI fabric can be single-pod or multi-pod.

      ACI Multi-Pod Design

      • A multi-pod design is considered an evolution of stretched fabric design.
      • Pods can be in the same physical location (intra-DC) or in separate locations (inter-DC) separated by a point-to-point network like dark fiber or DWDM, or by a traditional L3 infrastructre network like an MPLS network.
        • Whether it is MPLS or point-to-point, the transport network must have a maximum of 50ms RTT. This value depends also on the ACI firmware release.
      • Each Pod owns a separate control plane. However, the spines on both Pods exchange COOP entries using MultiProtocol BGP over Ethernet VPN (MP-BGP EVPN).
      • It involves an InterPod Network IPN consisting of IPN devices.
      • IPN devices:
        • can be routers or modular switches, which support MP-BGP.
        • must support Multicast PIM BiDir mode in order to correctly forward BUM traffic between Pods.
        • at least one IPN device connects to some of all spines per Pod. Ideally two IPN devices connect to all spines per Pod.
        • establish OSPF peering with spines of each pod.
        • have in their routing tables the TEP pool prefixes of the Pods.
        • Each IPN device installs a multicast source-group pair (*, G), with G = the GIPO value of each Bridge Domains of the attached Pod.
        • It is recommended to ensure the existence of a physical path anytime between an IPN device A and an IPN device B, whether they are connected to the same Pod or not.
        • Between IPN devices use 10/40/100Gbps connectivity
      • The spines that are peering with the IPN devices perform mutual redistribution:
        • they redistribute IS-IS prefixes (local TEP pool prefix) into OSPF, and
        • redistribute OSPF prefixes they learned from IPN devices (these OSPF prefixes are the remote TEP pool prefixes) into IS-IS, in order to let the local leafs learn them and know how to reach remote TEP addresses.
      • when IP communication fails between the Pods:
        • the Pod with the APIC majority still operates in read/write
        • the Pod with the APIC minority operates in read-only mode. When communication is restored, it synchronizes its database.

      ACI Dual Fabric Design

      In a dual fabric design, each site has its own APIC cluster and own ACI fabric. Both ACI fabrics are connected over the L2 or L3 networks, which are carried by some leafs at each site.

      ACI Multi-Site Design

      The multisite design is an evolution of the dual fabric design. Both ACI fabrics are connected over the WAN. The WAN is connected at the spines of each site.

      ACI Integration With Puppet

      • Puppet is a data center orchestration framework.
      • Puppet configuration includes preparing modules that will be downloaded onto puppet-compatible hardware platforms
      • Puppet components:
        • Puppet Master: the server that hosts the modules
        • Puppet Agents: installed on the Nexus switches
      • Cisco Nexus 9000 support Puppet natively in its API, i.e. we can install Puppet modules on the nexus switch.
      • A Puppet module contains configuration of a certain feature, for example SNMP, VRF, interface speed,… As soon as the module is downloaded on the switch, the changes in the config are visible.

      Configuration Zones

      • A Configuration Zone is a technology that reduces the impact of a change on the tenant infrastructure.
      • From a leaf up to a complete pod can be selected as a configuration zone.
      • The configuration change made within a configuration zone is:
        • enabled: which means “the change takes effect immediately”
        • disabled: which means “the change is queued but not executed.

      Software Management

      ACI acts as a software repository.

      Do not upgrade all fabric nodes at once! Define groups, upgrade a small number of groups, observe the result, then upgrade the rest.

      At any time, there is:

      • only one image for spines and leafs
      • only one image for the APICs

      Avoid running mixed firmware versions between switches for more than a couple of hours.

      Always upgrade the APICs first then the fabric switches.

      ACI Snapshots And Rollbacks

      Rollbacks are performed from the Admin tab.

      We can configure snapshots are rollbacks on ACI either for all objects or for a selected object.

      How to check the supervisors on a Nexus 9500

      Having redundant hardware on the ACI fabric spines is critical to maintain the fabric operational. The Cisco Nexus 9500 supports up to two supervisor modules that operate in an active-passive redundancy model.

      to see which supervisor is active and which one is standby:

      Go to Fabric –> Inventory. Select your Pod. Select the spine. Click on Chassis then Supervisor Modules

      The list of supervisors, their status and even their serial numbers are displayed on the left of the window:

      Other Notes

      ACI fat tree = an ACI full-meshed fabric. In other words, all spines are physically connected to all leafs.

      Scaling up a fabric = adding more spines. When we want more bandwidth and redundancy within the fabric we need to scale it up.

      Scaling out a fabric = adding more leafs.

      I’ve distilled all these notes from my ACI study material.

      Leave a Comment