findcrcs/crcutil-1.0/doc/text/crc.tex

\documentclass{article}

\usepackage{listings}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{url}
\usepackage{authblk}
\usepackage{array} \setlength{\extrarowheight}{1.5pt}


\begin{document}

\lstset{
%language=C++, % choose the language of the code
basicstyle=\footnotesize, % code font size
numbers=left, % where to put the line-numbers
numberstyle=\footnotesize, % line number font size
stepnumber=1, % the step between two line-numbers.
%backgroundcolor=\color{white}, % choose the background color
frame=single,
framerule=1pt,
captionpos=b, % t or b
showstringspaces=false, % underline spaces within strings
showspaces=false, % show spaces within strings with underscores
showtabs=false, % show tabs within strings with underscores
breaklines=true % Break long lines of code
}

\def\CRC{{\rm CRC}_u}
\def\SCRC{{\rm CRC}_0}
\def\BYTE{{\rm BYTE}}
\def\LCD{{\rm LCD}}
\def\CrcWord{{\rm CrcWord}}

\def\remove#1{}

% -----------------------------------------
\title{Everything we know about CRC but afraid to forget}
\author[1]{Andrew Kadatch}
\affil[1]{Google Inc.}
\author[2]{Bob Jenkins}
\affil[2]{Microsoft Corporation}
\maketitle

\begin{abstract}
  This paper describes a novel interleaved, parallelizeable word-by-word
  CRC computation algorithm which computes $N$-bit CRC ($N \leq 64$) on
  modern Intel and AMD processors in 1.2 CPU cycles per byte, improving
  state of the art over word-by-word 32-bit and 64-bit CRCs (2.1 CPU
  cycles/byte) and classic byte-by-byte CRC computation (6-7 CPU cycles/byte).
  It computes 128-bit CRC in 1.7 CPU cycles/byte.

  CRC implementations are heavily optimized and hard to understand. This
  paper describes CRC algorithms as they evolved over time, splitting
  complex optimizations into a sequence of natural improvements.

  This paper also presents a collection of CRC ``tricks" that we found
  handy on many occassions.
\end{abstract}

\tableofcontents


% -----------------------------------------
\section{Definition of CRC}

Cyclic Redundancy Check (CRC) is a well-known technique that allows the
recipient of a message transmitted over a noisy channel to detect whether
the message has been corrupted.

A message $M = m_0 \dots m_{N-1}$ comprised of $N=|M|$ bits ($m_k \in \{0,
1\}$) may be viewed either as a numeric value
  \begin{align*}
    M = \sum_{k=0}^{N-1} m_k 2^{N-1-k}
  \end{align*}
or as a polynomial of a single variable of degree $(N-1)$
  \begin{align*}
    M(x) = \sum_{k=0}^{N-1} m_k x^{N-1-k}
  \end{align*}
where $m_k \in GF(2) = \{0, 1\}$ and all arithmetic operations on
coefficients are performed modulo 2. For example,
  \begin{align*}
    & \mbox{Addition: }
        (x^3+x^2+x+1) + (x^2+x+1) = x^3+2x^2+2x+2 = x^3, \\
    & \mbox{Subtraction: }
        (x^3+x+1) - (x^2+x) = x^3-x^2+1 = x^3+x^2+1, \\
    & \mbox{Multiplication: }
        (x+1)(x+1) = x^2 + 2x + 1 = x^2 + 1.
  \end{align*}

For a given polynomial $P(x)$ of degree $D=\deg\bigl(P(x)\bigr)$,
$\CRC\bigl(M(x),v(x)\bigr)$ is the reminder from division of $\left(M(x)
\cdot x^D\right)$ by $P(x)$. In practice, a more complex formula is used:
  \begin{align}
    \label{e:crcdefinition}
    \CRC\bigl(M(x), v(x)\bigr)
      = \Bigl(\bigl(v(x)-u(x)\bigr) \cdot x^{|M|} + M(x) \cdot x^D + u(x)\Bigr)
        \bmod P(x),
  \end{align}
where polynomial $P(x)$ of degree $D$ and polynomial $u(x)$ of degree less
than $D$ are fixed.

The use of the non-zero value of $u(x)$ guarantees that the CRC of a sequence
of zeroes is different from zero. That allows detection of insertion of
zeroes in the beginning of a message and replacement of both content of the
message and its CRC value with zeroes. Typically,
  \begin{align}
    \label{e:constdefinition}
    u(x) &= \sum_{k=0}^{D-1} x^k.
  \end{align}

The use of auxilary parameter $v(x)$ allows incremental CRC computation
as shown in section \ref{s:incrementalcrc}.

% -----------------------------------------
\section{Related work}

Cyclic Redundancy Checks (CRCs) were proposed by Peterson and Brown
\cite{Peterson61} in 1961. An efficient table-driven software
implementation which reads and processes data byte by byte was described by
Hill \cite{Hill79} in 1979, Perez \cite{Perez83} in 1983. The ``classic"
byte-by-byte CRC algorithm described in section \ref{s:crcbyte} was
published by Sarwate \cite{DBLP:journals/cacm/Sarwate88} in 1988.

In 1993, Black \cite{Black93} published a method that reads data by words
(described in section \ref{s:crcbyteword}); however, it still computes the
CRC byte by byte in strong sequential order.

In 2001, Braun and Waldvogel \cite{\remove{%braun01fast,
}braun01fast-techreport} briefly outlined a specialized variant of a CRC
that could read input data by words and process them byte by byte -- but,
thanks to the use of multiple tables, different bytes from the input word
could be processed in parallel. In 2002, Ji and Killian \cite{JiKillian02}
provided detailed description and analysis of a nearly identical scheme.
Both solutions were targeted for hardware implementation. In 2005, Kouvanis
and Berry \cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
DBLP:journals/tc/KounavisB08}} demonstrated clear performance benefits of
this scheme even when it is implemented in software. A generalized version
of this approach is described in section \ref{s:crcword}.

Surprisingly, until \cite{Gopal2010} we have not seen prior art describing
or utilizing a method of computing a CRC by processing in parallel (in an
interleaved manner to utilize multiple ALUs) multiple input streams
belonging to non-overlapping sections of input data, desribed in section
\ref{s:blockword}.

A novel method of CRC computation that processes in parallel multiple words
belonging to overlapping sections of input data is described in section
\ref{s:multiword}. A special case restricted to the use of 64-bit tables,
64-bit reads, and 32 or 64-bit generating polynomials was implemented by
the authors in February-March 2007 and was used by a couple of Microsoft
products. In 2009, the algorithm was generalized and these limitations were
removed.

The fact that the CRC of a message followed by its CRC is a constant value
which does not depend on the message, described in section
\ref{s:storingcrcafter}, is well known and has been widely used in the
telecommunication industry for long time.

A method of storing a carefully chosen sequence of bits after a message so
that the CRC of a message and the sequence of bits appended to the message
produces predefined result, described in \ref{s:storingcrcafter}, was
implemented in 1990 by Zemtsov \cite{Zemtsov90}.

A method for recomputing a known CRC using a new initial CRC value,
described in section \ref{s:changinginitialvalue}, and the method of
computing a CRC of the concatenation of messages having known CRC values
without touching the actual data, described in section
\ref{s:concatenation}, were implemented by one of the authors in 2005 but
were not published.


% -----------------------------------------
\section{CRC tricks and tips}


% -----------------------------------------
\subsection{Incremental CRC computation} \label{s:incrementalcrc}

The use of an arbitrary initial CRC value $v(x)$ allows computation of a CRC
incrementally. If a message
  $M(x) = M_1(x) \cdot x^{|M_2|} + M_2(x)$
is a concatenation of messages $M_1$ and $M_2$, its CRC may be computed
piece by piece because
  \begin{align}
    \label{e:incremental}
    \CRC\bigl(M(x), v(x)\bigr)
      &= \CRC\Bigl(M_2(x), \CRC\bigl(M_1(x), v(x)\bigr)\Bigr).
  \end{align}

Indeed,
  \begin{align*}
    \CRC(M, v)
      &= \bigl((v-u) x^{|M|} + M x^D + u\bigr) \bmod P = \\
      &= \bigl((v-u) x^{|M_1|+|M_2|} + (M_1 x^{|M_2|} + M_2) x^D + u\bigr) \bmod P = \\
      &= \Bigl(\bigl((v-u) x^{|M_1|} + M_1 x^D \bigr) x^{|M_2|} + M_2 x^D + u\Bigr) \bmod P = \\
      &= \bigl(\CRC(M_1, v) x^{|M_2|} + M_2 x^D + u\bigr) \bmod P = \\
      &= \CRC\bigl(M_2, \CRC(M_1, v)\bigr)
  \end{align*}


% -----------------------------------------
\subsection{Changing initial CRC value} \label{s:changinginitialvalue}

If
  $\CRC\bigl(M(x), v(x)\bigr)$
for some initial value $v(x)$ is known, it is possible to compute
  $\CRC\bigl(M(x), v'(x)\bigr)$
for different initial value $v'(x)$ without touching the value of $M(x)$:
  \begin{align}
    \CRC(M, v')
      &= \CRC(M,v) + \Bigl((v'-v) x^{|M|}\Bigr) \bmod P.
  \label{e:fixv}
  \end{align}

Proof:
  \begin{align*}
    \CRC(M, v')
      &= \bigl((v'-u) x^{|M|} + M x^D + u\bigr) \bmod P = \\
      &= \Bigl(\bigl((v'-u)+(v-v)\bigr) x^{|M|} + M x^D + u\Bigr) \bmod P = \\
      &= \Bigl(\bigl((v-u)+(v'-v)\bigr) x^{|M|} + M x^D + u\Bigr) \bmod P = \\
      &= \Bigl(\bigl((v-u) x^{|M|} + M x^D + u\bigr) + (v'-v) x^{|M|}\Bigr) \bmod P = \\
      &= \CRC(M,v) + \Bigl((v'-v) x^{|M|} \bmod P \Bigr).
  \end{align*}


% -----------------------------------------
\subsection{Concatenation of CRCs} \label{s:concatenation}

If a message
  $M(x) = M_1(x) \cdot x^{|M_2|} + M_2(x)$
is a concatenation of messages $M_1$ and $M_2$, and CRCs of $M_1$, $M_2$
(computed with some initial values $v_1(x)$, $v_2(x)$ respectively) are
known,
 $\CRC\bigl(M(x),v(x)\bigr)$
may be computed without touching contents of the message $M$:
\begin{enumerate}
\item
  Using formula (\ref{e:fixv}), the value of $v'_1 = \CRC(M_1,v)$ may
  be computed from the known $\CRC(M_1,v_1)$ without touching the contents of $M_1$.
\item
  Then, $v'_2 = \CRC(M_2, v'_1)$ may be computed from known $\CRC(M_2,v_2)$
  without touching the contents of $M_2$.
\end{enumerate}
According to (\ref{e:incremental}), $\CRC(M,v) = v'_2$.


% -----------------------------------------
\subsection{In-place modification of CRC-ed message} \label{s:replacement}

Sometimes it is necessary to replace a part of message $M(x)$ in-place and
recompute CRC of modified message $M(x)$ efficiently.

If a message $M=ABC$ is a concatenation of messages $A$, $B$, and $C$, and
$B'(x)$ is new message of the same length as $B(x)$, $\CRC(M')$ of message
$M'=AB'C$ may be computed from known $\CRC(M)$. Indeed,
  \begin{align*}
    M(x)  &= A(x) \cdot x^{|B| + |C|} + B(x) \cdot x^{|C|} + C(x), \\
    M'(x) &= A(x) \cdot x^{|B| + |C|} + B'(x) \cdot x^{|C|} + C(x) = \\
          &= M(x) + \bigl(B'(x) - B(x)\bigr) \cdot x^{|C|},
  \end{align*}
therefore
  \begin{align*}
    &  \CRC\bigl(M'(x),v(x)\bigr) = \\
    &= \CRC\Bigl(M(x) +\bigl(B'(x) - B(x)\bigr) \cdot x^{|C|}\Bigr) = \\
    &= \Bigl(\bigl(v(x)-u(x)\bigr) x^{|M|} + M(x) x^D + \bigl(B'(x) - B(x)\bigr) x^{|C| + D} + u(x)\Bigr) \bmod P(x) \\
    &= \Bigl(\CRC\bigl(M(x),v(x)\bigr) + \bigl(B'(x) - B(x)\bigr) x^{|C| + D}\Bigr) \bmod P(x) = \\
    &= \CRC\bigl(M(x),v(x)\bigr) + \Bigl(\bigl(B'(x) - B(x)\bigr) x^{|C| + D} \bmod P(x)\Bigr).
  \end{align*}

It is easy to see that
  \begin{align*}
    &  \CRC\bigl(B'(x),v(x)\bigr) - \CRC\bigl(B(x),v(x)\bigr) = \\
    &= \bigl(B'(x) - B(x)\bigr) x^{D} \bmod P(x),
  \end{align*}
so
  \begin{align*}
    & \CRC\bigl(M'(x),v(x)\bigr) = \CRC\bigl(M(x),v(x)\bigr) + \Delta \\
  \end{align*}
where
  \begin{align*}
    & \Delta = \Bigl(\CRC\bigl(B'(x),v(x)\bigr) - \CRC\bigl(B(x),v(x)\bigr) \Bigr) x^{|C|} \bmod P(x).
  \end{align*}

% -----------------------------------------
\subsection{Storing CRC value after the message} \label{s:storingcrcafter}

Often $Q(x) = \CRC\bigl(M(x),v(x)\bigr)$ is padded with zero bits until the
nearest byte or word boundary and is transmitted as a sequence of $W$ bits
($W \geq D$) right after the message $M(x)$. This way, the transmitted
message $T(x)$ is the concatenation of $M(x)$ and $Q(x)$ followed by
$(W-D)$ zeroes, and is equal to
  \begin{align*}
    T(x) = M(x) \cdot x^W + Q(x) \cdot x^{W-D}.
  \end{align*}

According to (\ref{e:crcdefinition}), (\ref{e:incremental}) and taking into
account that $Q(x)+Q(x) = 0$ since polynomial coefficient are from $GF(2)$,
$\CRC\bigl(T(x), v(x)\bigr)$ is a constant value which does not depend on
the contents of the message and is equal to
  \begin{align*}
    & \CRC\bigl(T(x), v(x)\bigr) = \\
    & = \CRC\Bigl(Q(x) \cdot x^{W-D}, CRC\bigl(M(x), v(x)\bigr)\Bigr) = \\
    & = \CRC\bigl(Q(x) \cdot x^{W-D}, Q(x)\bigr) = \\
    & = \Bigl(\bigl(Q(x)-u(x)\bigr) \cdot x^W + Q(x)\cdot x^{W-D} \cdot x^D + u(x)\Bigr)
        \bmod P(x) = \\
    & = \Bigl(u(x)\left(1 - x^W\right)\Bigr) \bmod P(x).
  \end{align*}

A more generic solution is to store a $W$-bit long value after the message
such that the CRC of the transmitted message is equal to a predefined value
$R(x)$ (typically $R(x)=0$). The $D$-bit value followed by $(W-D)$ zero
bits that should be stored after $M(x)$ is
  \begin{align*}
    \hat{q}\bigl(Q(x)\bigr) = \Bigl(\bigl(R(x) - u(x)\bigr) x^{-W} - \bigl(Q(x) - u(x)\bigr)\Bigr) \bmod P(x)
  \end{align*}
where $x^{-W}$ is the multiplicative inverse of $x^W \bmod P(x)$ which
exists if $P(x)$ is not divisble by $x$ and may be found by the extended
Euclidean algorithm \cite{Hasan01}:
  \begin{align*}
    & \CRC\Bigl(\hat{q}\bigl(Q(x)\bigr)x^{W-D}, CRC\bigl(M(x), v(x)\bigr)\Bigr) = \\
    & = \CRC\Bigl(\hat{q}\bigl(Q(x)\bigr)x^{W-D}, Q(x)\Bigr) = \\
    & = \Bigl(\bigl(Q(x)-u(x)\bigr) \cdot x^W + \hat{q}\bigl(Q(x)\bigr) \cdot x^{W-D} \cdot x^D + u(x)\Bigr)
          \bmod P(x) = \\
    & = R(x).
  \end{align*}


% -----------------------------------------
\section{Efficient software implementation}

% -----------------------------------------
\subsection{Mapping bitstreams to hardware registers}

For little-endian machines (assumed from now on), the result of loading of
a $D$-bit word from memory into hardware register matches the expectations:
the 0-th bit of the 0-th byte becomes the 0-th (least significant) bit of
the word corresponding to $x^{(D-1)}$.

For example, the 32-bit sequence of 4 bytes 0x01, 0x02, 0x03, 0x04
(0x04030201 when loaded into a 32-bit hardware register) corresponds to the
polynomial
  \begin{align*}
    \left(x^{31} + x^{22} + x^{15} + x^{14} + x^{5}\right).
  \end{align*}

Addition and subtraction of polymonials with coefficients from $GF(2)$ is
the bitwise XOR of their coefficients. Multiplication of a polynomial by
$x$ is achieved by logical right shift of register contents by 1 bit. If a
shift operation causes a carryover, the resulting polynomial has degree
$D$.

Polynomials of degree less than $D$ whose coefficients are recorded using
exactly $D$ bits irrespective of actual degree of the polynomial will be
called {\it $D$-normalized}.

Whenever possible -- and unless mentioned explicitly -- all polynomials
will be represented in $D$-normalized form.

Since the generating polynomial $P(x)$ is of degree $D$ and has $(D+1)$
coefficients, it does not fit into the $D$-bit register. However, its most
significant coefficient is guaranteed to be 1 and may be implied
implicitly.

% -----------------------------------------
\subsection{Multiplication of $D$-normalized polynomials} \label{s:shiftandadd}

Multiplication of two $D$-normalized polynomials may be accomplished by
traditional bit-by-bit, shift-and-add multiplication. This is adequate if
performance is not a concern. Sample code is given in listing
\ref{l:MulNormalizedPoly}.

\begin{figure}
\begin{lstlisting}[caption={Multiplication of normalized polynomials},label={l:MulNormalizedPoly}]
// "a" and "b" occupy D least significant bits.
Crc Multiply(Crc a, Crc b) {
  Crc product = 0;
  Crc bPowX[D];  // bPowX[k] = (b * x**k) mod P
  bPowX[0] = b;
  for (int k = 0; k < D; ++k) {
    // If "a" has non-zero coefficient at x**k,
    // add ((b * x**k) mod P) to the result.
    if (((a & (1 << (D-k)) != 0) product ^= bPowX[k];

    // Compute bPowX[k+1] = (b ** x**(k+1)) mod P.
    if (bPowX[k] & 1) {
      // If degree of (bPowX[k] * x) is D, then
      // degree of (bPowX[k] * x - P) is less than D.
      bPowX[k+1] = (bPowX[k] >> 1) ^ P;
    } else {
      bPowX[k+1] = bPowX[k] >> 1;
    }
  }
  return product;
}
\end{lstlisting}
\end{figure}

% -----------------------------------------
\subsection{Multiplication of unnormalized polynomial}

During initialization of CRC tables it may be necessary to multiply
$d$-normalized polynomial $v(x)$ of a degree $d \neq D$ by a $D$-normalized
polynomial. It may be accomplished by representing the operand as a sum of
weighted polynomials of degree of no more than $(D-1)$, then calling
$Multiply()$ function repeatedly as shown in listing
\ref{l:MulUnnormalizedPoly}.

\begin{figure}
\begin{lstlisting}[caption={Multiplication of unnormalized polynomial},label={l:MulUnnormalizedPoly}]
// "v" occupies "d" least signficant bits.
// "m" occupies D least significant bits.
Crc MultiplyUnnormalized(Crc v, int d, Crc m) {
  Crc result = 0;
  while (d > D) {
    Crc temp = v & ((1 << D) - 1);
    v >>= D;
    d -= D;
    // XpowN returns (x**N mod P(x)).
    result ^= Multiply(temp, Multiply(m, XpowN(d)));
  }
  result ^= Multiply(v << (D - d), m);
  return result;
}
\end{lstlisting}
\end{figure}


% -----------------------------------------
\subsection{Computing powers of $x$} \label{s:mulpown}

Often (see sections \ref{s:changinginitialvalue}, \ref{s:concatenation},
\ref{s:storingcrcafter}) it is necessary to compute $x^N \bmod P(x)$ for
very large values of $N$. This may be accomplished in
$O\bigl(\log(N)\bigr)$ time.

Consider the binary representation of $N$:
  \begin{align*}
    N = \sum_{k=0}^K n_k 2^k
  \end{align*}
where $n_k \in \{0, 1\}$. Then
  \begin{align}
    x^N &= x^{\sum n_k 2^k}
         = \prod_{k=0}^K x^{n_k 2^k}
         = \prod_{n_k != 0} x^{2^k} \label{e:pow2k}
  \end{align}
and may be computed using no more than
    $\left(\left\lfloor \log_2(N) \right\rfloor + 1\right)$
multiplications of polynomials of degree less than $D$ provided known
values of
  \begin{align}
    Pow2k(k) = x^{2^k} \bmod P(x).
  \end{align}

Values of $Pow2k(k)$ may be computed iteratively using one multiplication
$\bmod P(x)$ per iteration:
  \begin{align*}
    Pow2k(0) &= 0, \\
    Pow2k(k + 1)
      &= x^{2^{k+1}} \bmod P(x) = \\
      &= x^{2 \cdot 2^k} \bmod P(x) = \\
      &= \left(x^{2^k}\right)^2 \bmod P(x) = \\
      &= \Bigl(Pow2k(k-1)\Bigr)^2 \bmod P(x).
  \end{align*}

% -----------------------------------------
\subsection{Simplified CRC}

It is sufficient to be able to compute
  \begin{align}
    \SCRC\bigl(M(x), v(x)\bigr)
      &= \Bigl(v(x) \cdot x^{|M|} + M(x) \cdot x^D\Bigr)
        \bmod P(x), \label{e:simplifiedcrc}
  \end{align}
since
  \begin{align*}
    \CRC\bigl(M(x),v(x)\bigr) &= \SCRC\bigl(M(x), v(x) - u(x)\bigr) + u(x),
  \end{align*}
$\CRC\bigl(M(x),v(x)\bigr)$ of message $M = M_1 \ldots M_K$ may be computed
incrementally using $\SCRC$ instead of $\CRC$:
  \begin{align*}
    v_0(x) &= v(x) - u(x), \\
    v_k(x) &= \SCRC\bigl(M_k(x),v_{k-1}(x)\bigr), \\
    \CRC(M(x), v(x)) &= v_K + u(x).
  \end{align*}


% -----------------------------------------
\subsection{Computing a CRC byte by byte} \label{s:crcbyte}

If $M(x)$ is $W$-bit value (typically, $W=8$) and
$\deg\bigl(v(x)\bigr) < D$, by definition (\ref{e:simplifiedcrc})
  \begin{align*}
    \SCRC\bigl(M(x), v(x)\bigr)
      = \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x).
  \end{align*}

When $D \leq W$,
\begin{align}
  \SCRC\bigl(M(x), v(x)\bigr)
    &= \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
    &= \Bigl(\bigl(v(x) \cdot x^{W-D} + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x), \label{e:crcbytetable2}
\end{align}
which may be obtained via single lookup into precomputed table $T$ of size
$2^W$ such that $T[i] = \bigl(i(x) \cdot x^D)\bigr) \bmod P(x)$ since
$\deg\bigl(v(x) \cdot x^{W-D} + M(x)\bigr) < W$.

$D$-normalized representation of $v(x)$ occupies $D$ least significant bits
and is equal to $\left(v(x) \cdot x^{W-D}\right)$ when viewed as
$W$-normalized representation which is required to form $W$-bit index into
a table of $2^W$ entries. Therefore, explicit multiplication of $v(x)$ by
$x^{W-D}$ in formula (\ref{e:crcbytetable2}) is not required.

When $D \geq W$, $v(x)$ may be represented as
  \begin{align*}
    v(x) = v_L(x) + v_H(x) \cdot x^{D-W}
  \end{align*}
where
  \begin{align*}
    v_H(x) &= \left\lfloor\frac{v(x)}{x^{D-W}}\right\rfloor,
        &\deg\bigl(v_H(x)\bigr) &< W, \\
    v_L(x) &= v(x) \bmod x^{D-W},
        &\deg\bigl(v_L(x)\bigr) &< D-W.
  \end{align*}

Since $\deg\bigl(v_L(x) \cdot x^W \bigr) < D$, $\Bigl(v_L(x) \cdot
x^W\Bigr) \bmod P(x) = v_L(x) \cdot x^W$. Therefore,
  \begin{align}
    & \SCRC\bigl(M(x), v(x)\bigr) = \nonumber \\
      &= \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(\bigl(v_L(x) + v_H(x) \cdot x^{D-W}\bigr)\cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(v_L(x) \cdot x^W\Bigr) \bmod P(x) + \Bigl(\bigl(v_H(x) + M(x)\bigr) \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(v_L(x) \cdot x^W\Bigr) + \mbox{MulByXpowD}\bigl(v_H(x) + M(x)\bigr), \label{e:crcbyte}
  \end{align}
where
  \begin{align}
    \mbox{MulByXpowD}\bigl(a(x)\bigr) = \bigl(a(x) \cdot x^D \bigr) \bmod P(x). \label{e:crcbytetable}
  \end{align}

The value of $\bigl(v_L(x) \cdot x^W\bigr)$ may be computed by shifting
$v(x)$ by $W$ bits and discarding $W$ carry-over zero bits.

Since $\deg\bigl(v_H(x) + M(x)\bigr) < W$, the value of
$\mbox{MulByXpowD}\bigl(v_H(x) + M(x)\bigr)$ may be obtained using
precomputed table containing $2^W$ entries.

The classic table-driven, byte-by-byte CRC computation \cite{Perez83,
DBLP:journals/cacm/Sarwate88} implementing formulas
(\ref{e:crcdefinition}), (\ref{e:incremental}), (\ref{e:crcbytetable2}),
(\ref{e:crcbyte}), and (\ref{e:crcbytetable}) for $W=8$ is given in listing
\ref{l:CrcByte}.

\begin{figure}
\begin{lstlisting}[caption={Computing CRC byte by byte},label={l:CrcByte}]
Crc CrcByte(Byte value) {
  return MulByXpowD[value];
}
Crc CrcByteByByte(Byte *data, int n, Crc v, Crc u) {
  Crc crc = v ^ u;
  for (int i = 0; i < n; ++i) {
    Crc ByteCrc = CrcByte(crc ^ data[i]);
    crc >>= 8;
    crc ^= ByteCrc;
  }
  return (crc ^ u);
}
void InitByteTable() {
  for (int i = 0; i < 256; ++i) {
    MulByXPowD[i] = MultiplyUnnormalized(i, 8, XpowN(D));
  }
}
\end{lstlisting}
\end{figure}

Experience shows that computing CRC byte by byte is rather slow and,
depending on a compiler and input data size, takes $6-8$ CPU cycles per
byte on modern 64-bit CPU for $D <= 64$. There are two reasons for it:

\begin{enumerate}
\item
  Reading data 8 bits at a time is not the most efficient data access
  method on 64-bit CPU.
\item
  Modern CPUs have multiple ALUs and may execute 3-4 instructions per CPU
  cycles provided the instructions handle independent data flows. However,
  byte-by-byte CRC contains only one data flow. Futhermore, most
  instructions use the result from the previous instruction, leading to CPU
  stalls because of result propagation delays.
\end{enumerate}


% -----------------------------------------
\subsection{Rolling CRC} \label{s:rollingcrc}

Given a set of messages $M_k=m_{k} \ldots m_{k+N-1}$ where $m_k$ are
$W$-bit symbols and $N$ is fixed (i.e. each next message is obtained by
removing first symbol and appending new one), $C_{k+1} = \CRC(M_{k+1}, v)$
may be obtained from known $C_k = \CRC(M_k, v)$ and symbols $m_k$ and
$m_{k+N}$ only, without the need to compute CRC of entire message
$M_{k+1}$. This property may be utilized to efficiently compute a set of
rolling Rabin fingerpints.

Since $M_{k+1}(x) = M_k(x) x^W - m_{k}(x) x^{NW} + m_{k+N}(x)$,
  \begin{align*}
    & C_{k+1}(x) = \CRC\bigl(M_{k+1}(x), v(x)\bigr) = \\
    &= \left(\bigl(v(x)-u(x)\bigr) x^{NW} + u(x) + \sum_{n=0}^{N-1} m_{k+1+n}(x) x^{D+W(N-1-n)} \right) \bmod P(x) = \\
    &= F\bigl(C_k(x), m_{k+N}(x)\bigr) + G\bigl(m_k(x)\bigr),
  \end{align*}
where
  \begin{align*}
    & F\bigl(C_k(x), m_{k+N}(x)\bigr) = \Bigl(C_k(x) x^W + m_{k+N}(x) x^D\Bigr) \bmod P, \\
    & G\bigl(m_k(x)\bigr) = \Bigl(\bigl(\bigl(v(x)-u(x)\bigr) x^{NW} + u\bigr) (1 - x^W) - m_k(x) x^{D+NW} \Bigr) \bmod P
  \end{align*}
are polynomials of degree less than $D$.

$G\bigl(m_{k-1}(x)\bigr)$ may be computed easily via a single lookup
in a table of $2^W$ entries indexed by $m_k$.

Computation of $F\bigl(C_k(x), m_{k+N}(x)\bigr)$ may be implemented as
described in section \ref{s:crcbyte} and requires one bitwise shift, one
bitwise XOR, and one lookup into a precomputed table containing $2^W$
entries.


% -----------------------------------------
\subsection{Reading multiple bytes at a time} \label{s:crcbyteword}

One straightforward way to speed up byte-by-byte CRC computation is to read
$W > 8$ bits at once. Unfortunately, this is the path of very rapidly
diminishing return as the size of the MulByPowD table increases with $W$
exponentially. From practical perspective, it is extremely desirable to
ensure that the MulByPowD table fits into the L1 cache (32-64KB), otherwise
table entry access latency sharply increases from 3-4 CPU cycles (L1 cache)
to 15-20 CPU (L2 cache).

The value of $\mbox{MulByXpowD}\bigl(v(x)\bigr)$ may be computed
iteratively using a smaller table because
  \begin{align}
    \mbox{MulByXpowD}\bigl(v(x)\bigr)
      = v(x) \cdot x^D \bmod P(x)
      = \SCRC\bigl(v(x), 0\bigr) \label{e:readwordatonce}
  \end{align}
and therefore may be computed using formulas (\ref{e:incremental}) and
(\ref{e:crcbyte}) for smaller values of $W'$.

\cite{Black93} provided the implementation for $W=32$ and $W'=8$. Our more
general implementation was faster than byte-by-byte CRC but not
substentially: the improvement was in 20-25\% range. However, the result is
still important -- it demonstrates that reading input data per se is not a
bottleneck.


% -----------------------------------------
\subsection{Computing a CRC word by word} \label{s:crcword}

The value of $\mbox{MulByXpowD}\bigl(v(x)\bigr)$ may be computed using
multiple smaller tables instead of one table. Given that
$\deg\bigl(v(x)\bigr) < W$, $v(x)$ may be represented as a weighted
sum of polynomials $v_k(x)$ such that $\deg\bigl(v_k(x)\bigr) < B$:
  \begin{align*}
    v(x) = \sum_{k=0}^{K-1} v_k(x) \cdot x^{(K-1-k)B},
  \end{align*}
where $K = \lceil W/B \rceil$ and
  \begin{align*}
    v_k(x) = \left\lfloor \frac{v(x)}{x^{(K-1-k)B}} \right\rfloor \bmod x^B.
  \end{align*}

Consequently,
  \begin{align}
    \mbox{MulByXpowD}\bigl(v(x)\bigr)
      &= v(x) \cdot x^D \bmod P(x) = \nonumber \\
      &= \left(\sum_{k=0}^{K-1} v_k(x) \cdot x^{(K-1-k)B}\right) \cdot x^D \bmod P(x) = \nonumber \\
      &= \sum_{k=0}^{K-1} \left(v_k(x) \cdot x^{(K-1-k)B+D} \bmod P(x)\right) = \nonumber \\
      &= \sum_{k=0}^{K-1} \mbox{MulWordByXpowD}\bigl(k, v_k(x)\bigr), \label{e:crcword}
  \end{align}
where the values of
  \begin{align}
    \mbox{MulWordByXpowD}\bigl(k, v_k(x))\bigr) = v_k(x) \cdot x^{(K-1-k)B+D} \bmod P(x) \label{e:crcwordtable}
  \end{align}
may be obtained using $K$ precomputed tables. Given that
$\deg\bigl(v_k(x)\bigr) < B$, each table should contain $2^B$
entries.

A sample implementation of formulas (\ref{e:crcdefinition}),
(\ref{e:incremental}), (\ref{e:crcword}), and (\ref{e:crcwordtable}) is
given in listing \ref{l:CrcWord} using $B=8$ and assuming that $W$ is a
multiple of 8.

\begin{figure}
\begin{lstlisting}[caption={Computing CRC word by word},label={l:CrcWord}]
Crc CrcWord(Word value) {
  Crc result = 0;
  // Unroll this loop or let compiler do it.
  for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
    result ^= MulWordByXpowD[byte][(Byte) value];
    value >>= 8;
  }
  return result;
}
Crc CrcWordByWord(Word *data, int n, Crc v, Crc u)
  Crc crc = v ^ u;
  for (int i = 0; i < n; ++i) {
    Crc WordCrc = CrcWord(crc ^ data[i]);
    if (sizeof(Crc) <= sizeof(Word)) {
      crc = WordCrc;
    } else {
      crc >>= 8;
      crc ^= WordCrc;
    }
  }
  return (crc ^ u);
}
void InitWordTables() {
  for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
    // (K-1-k)*B + D = (W/8-1-byte)*8 + D = D - 8 + W - 8*byte.
    Crc m = XpowN(D - 8 + sizeof(Word)*8 - 8*byte);
    for (int i = 0; i < 256; ++i) {
      MulWordByXpowD[byte][i] =MultiplyUnnormalized(i, 8, m);
    }
  }
}
\end{lstlisting}
\end{figure}

CrcWordByWord\footnote{The variant presented in this paper is more general
than ``slicing" described in \cite{Kounavis2005\remove{,
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}}. Sample
implementation given in listing \ref{l:CrcWord} does not include one subtle
optimization implemented in \cite{Kounavis2005\remove{,
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}} as it was found
to be counter-productive.} with $W=64$ uses only 2.1-2.2 CPU cycles/byte on
modern 64-bit CPUs (our implementation is somewhat faster than the one
described in \cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
DBLP:journals/tc/KounavisB08}}). It solves the problem with data access
and, to lesser degree, allows instruction level parallelism: in the middle
of the unrolled main loop of CrcOfWord function the CPU may process
multiple bytes in parallel.

However, this solution is still imperfect -- the beginning of computation
contends for a single source of data (variable $value$), and the end of
computation contends for a single destination (variable $result$). Further
improvement requires processing of multiple independent data streams in
interleaved manner so that when computation of one data flow path is
stalled the CPU may proceed with another one.

% -----------------------------------------
\subsection{Processing non-overlapping blocks in parallel} \label{s:blockword}

Straighforward pipepiling may be achieved by spliting the input message
$M(x)=M_0(x) \ldots M_{N-1}(x)$ into $N$ blocks $M_k(x)$ of approximately
the same size and computing CRC of each block in an interleaved manner,
concatenating CRCs of individual blocks in the end. A sample implementation
is given in listing \ref{l:CrcWordBlock}.

\begin{figure}
\begin{lstlisting}[caption={Processing non-overlapping blocks in parallel},label={l:CrcWordBlock}]
// Processes N stripes of StripeWidth words each
// word by word, in an interleaved manner.
Crc CrcWordByWordBlocks(Word *data, Crc v, Crc u) {
  assert(n % (N * StripeWidth) == 0);
  // Use N local variables instead of the array.
  Crc crc[N];
  // Initialize the CRC value for each stripe.
  crc[0] = v ^ u;
  for (int stripe = 1; stripe < N; ++stripe)
    crc[i] = 0 ^ u;
  // Compute each stripe's CRC.
  for (int i = 0; i < StripeWidth; ++i) {
    // Compute multiple CRCs in interleaved manner.
    Word buf[N];
    for (int stripe = 0; stripe < N; ++stripe) {
      buf[i] =
        crc[stripe] ^ data[i + stripe * StripeWidth];
      if (D > sizeof(Word) * 8) {
        crc[stripe] >>= D - sizeof(Word) * 8;
      } else {
        crc[stripe] = 0;
      }
    }
    for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
      for (int stripe = 0; stripe < N; ++stripe) {
        crc[stripe] ^=
          MulWordByXpowD[byte][(Byte) buf[stripe]];
        buf[stripe] >>= 8;
      }
    }
  }
  // Combine stripe CRCs.
  for (int stripe = 1; stripe < N; ++stripe) {
    crc[0] = ChangeStartingValue(
                crc[stripe], StripeWidth, 0, crc[0]);
  }
  return (crc[0] ^ u);
}
\end{lstlisting}
\end{figure}

A tuned implementation of $CrcWordByWordBlocks$ is capable of processing
data at 1.3-1.4 CPU cycles/byte on sufficiently large (64KB and more)
inputs, which is noticeably better that 2.1-2.2 CPU cycles/byte delivered
by word by word CRC computation. It is a good sign that it is a move in
right direction.

The drawbacks of this approach are obvious: it does not work well with
small inputs -- the cost of CRC concatentation becomes a bottleneck, -- and
it may be susceptible to false cache collisions caused by cache line
aliasing.

If the cost of CRC concatenation was not a problem, cache pressure could be
mitigated with the use of very narrow stripes. The code in question, lines
33-37 of listing \ref{l:CrcWordBlock} which combine CRCs of individual
stripes, iteratively computes
  \begin{align*}
    \mbox{crc}_0(x) = \mbox{crc}_k(x) + \Bigl(\mbox{crc}_0 \cdot x^{8S} \bmod P(x)\Bigr)
  \end{align*}
for $k = 1, \ldots, N-1$ where $N$ and $S$ are the number and the width of
the stripes respectively. It may be rearranged as
  \begin{align*}
    \mbox{crc}_0(x) = \sum_{k = 0}^{N-1} \Bigl(\mbox{crc}_{K-1-k} \cdot x^{8kS} \bmod P(x)\Bigr).
  \end{align*}

Explicit multiplication by $x^{8kS}$ may be avoided by moving it into preset
tables
  \begin{align*}
    \mbox{MulWordByXPowD}_k(n) = \mbox{MulWordByXPowD}(n) \cdot x^{kS} \bmod P(x).
  \end{align*}
that are used to compute $crc'_k(x) = \mbox{crc}_k(x) \cdot x^{8kS}$, so that
  \begin{align*}
    \mbox{crc}_0(x) = \sum_{k = 0}^{N-1} crc'_k.
  \end{align*}

Unfortunately, this approach alone does not help because
\begin{enumerate}
\item
  It increases the memory footprint of MulWordByXPowD by factor of $N$.
  Once the cumulative size of $\mbox{MulWordByXPowD}_k$ tables exceeds the
  size of L1 cache (32-64KB), the cost of memory access to multiplication
  table data increases from 3-4 CPU cycles to 15-20, eliminating all
  performance gains achieved by reducing the number of table operations.
\item
  It is still necessary to combine all $N$ values of $\mbox{crc}_k$ into
  $\mbox{crc}_0$ at the end of the CRC computation.
\end{enumerate}


% -----------------------------------------
\subsection{Interleaved word-by-word CRC} \label{s:multiword}

% -----------------------------------------
\subsubsection{Parallelizing CRC computation} \label{s:parallelizing}

Assume that input message $M$ is the concatenation of $K$ groups $g_k$, and
each group $g_k$ is concatenation of $N$ $W$-bit long words:
  \begin{align*}
    M(x) &= \sum_{k=0}^{K-1} g_k(x) \cdot x^{(K-1-k)NW}, \\
    g_k(x) &= \sum_{n=0}^{N-1} m_{k, n} \cdot x^{(N-1-n)W}.
  \end{align*}

Input message $M(x)$ may be represented as
  \begin{align}
    M(x)
      &= \sum_{k=0}^{K-1} g_k(x) \cdot x^{(K-1-k)NW} = \nonumber \\
      &= \sum_{k=0}^{K-1} \left(\sum_{n=0}^{N-1} m_{k, n} \cdot x^{(N-1-n)W} \right) \cdot x^{(K-1-k)NW} = \nonumber \\
      &= \sum_{n=0}^{N-1} \left(\sum_{k=0}^{K-1} m_{k, n} \cdot x^{(K-1-k)NW}\right) \cdot x^{(N-1-n)W} = \nonumber \\
      &= \sum_{n=0}^{N-1} M_n(x) \cdot x^{(N-1-n)W} \label{e:splitbyword}
  \end{align}
where
  \begin{align*}
    M_n(x)
      &= \sum_{k=0}^{K-1} m_{k, n} \cdot x^{(K-1-k)NW}.
  \end{align*}

In other words, $M_n$ is concatenation of $n$-th $W$-bit word from $g_0$
followed by $(N-1)W$ zero bits, then $n$-th word from $g_1$ followed by
$(N-1)W$ zero bits, etc., ending up with $n$-th word from last group
$g_{K-1}$.

Appending $(N-1)W$ zero bits to $M_n$ yields $M'_n(x) = M_n(x) \cdot
x^{(N-1)W} $ which may be viewed as the concatenation of $K$ $NW$-bit
groups $f_{k}$:
  \begin{align*}
    M'_n(x)
      &= M_n(x) \cdot x^{(N-1)W}
      = \sum_{k=0}^{K-1} f_{k, n} \cdot x^{(K-1-k)NW}, \\
    f_{k, n}(x) &= m_{k, n}(x) \cdot x^{(N-1)W}, \\
  \end{align*}
so
  \begin{align}
    M(x)
      &= \sum_{n=0}^{N-1} M_n(x) \cdot x^{(N-1-n)W} \nonumber \\
      &= \sum_{n=0}^{N-1} M'_n(x) \cdot x^{-(N-1)W} \cdot x^{(N-1-n)W} \nonumber \\
      &= \sum_{n=0}^{N-1} M'_n(x) \cdot x^{-nW}. \label{e:mdash}
  \end{align}

According to (\ref{e:incremental}),
  $v_{K, n}(x) = \SCRC\bigl(M'_n(x), v_{0, n}(x)\bigr)$
may be computed incrementally:
  \begin{align}
    v_{k+1, n}(x)
      &= \SCRC\bigl(f_{k, n}(x), v_{k, n}(x)\bigr) = \nonumber \\
      &= \SCRC\bigl(m_{k, n}(x) \cdot x^{(N-1)W}, v_{k, n}(x)\bigr) = \nonumber \\
      &= \Bigl(v_{k, n}(x) \cdot x^{NW} + m_{k, n}(x) \cdot x^{(N-1)W} \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
      &= \Bigl(v_{k, n}(x) \cdot x^W + m_{k, n}(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \label{e:crcwordnmultiply} \\
      &= \mbox{CrcWordN}\bigl(m_{k, n}(x), v_{k, n}(x)\bigr). \label{e:crcwordn}
  \end{align}

This approach:
\begin{enumerate}
\item
  Creates $N$ independent data flows: computation of $v_{k, 0}, \ldots,
  v_{k, N-1}$ may be performed truly in parallel. There are no contentions
  on a single data source or destination like those the word-by-word CRC
  computation described in section \ref{s:crcword} suffered from.
\item
  Input data is accessed sequentially. Therefore, the load on cache
  subsystem and false cache collisions are  minimal. Thus, the performance
  bottlenecks of approach described in \ref{s:blockword} are eliminated.
\end{enumerate}


% -----------------------------------------
\subsubsection{Combining individual CRCs} \label{s:combine}

Once $v_{K, n}(x) = \SCRC\bigl(M'_n(x), v_{0, n}(x)\bigr)$ are computed starting with
  \begin{align*}
    v_{0, 0} &= v(x), \\
    v_{0, n} &= 0, n \geq 1,
  \end{align*}
by definition (\ref{e:simplifiedcrc}) of $\SCRC$ and relationship (\ref{e:mdash}),
  \begin{align}
    \SCRC\bigl(M(x), v(x)\bigr)
      &= \SCRC\left(\sum_{n=0}^{N-1} M'_n(x) \cdot x^{-nW}, v(x)\right) = \nonumber \\
      &= \sum_{n=0}^{N-1} \SCRC\bigl(M'_n(x) \cdot x^{-nW}, v_{0, n}(x) \bigr) = \nonumber \\
      &= \sum_{n=0}^{N-1} \SCRC\bigl(M'_n(x), v_{0, n}(x) \bigr) \cdot x^{-nW} = \nonumber \\
      &= \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{-nW}. \label{e:multiwordcrc1}
  \end{align}

Even though this step is performed only once per input message, it still
requires $(N-1)$ non-trivial multiplications modulo $P(x)$ negatively
affecting the performance on small input messages. Also,
(\ref{e:multiwordcrc1}) uses the multiplicative inverse of $x^{nW}$ modulo
$P(x)$ which does not exists when $P(x) \bmod x = 0$.

There is more efficient and elegant solution. Assume that $M(x)$ is
followed by one more group $g_K(x)$. Then
  \begin{align}
    & \SCRC\bigl(M(x) \cdot x^{NW} + g_K(x), v(x)\bigr) = \nonumber \\
    & = \SCRC\Bigl(g_K(x), \SCRC\bigl(M(x), v(x)\bigr)\Bigr) = \nonumber \\
    & = \Bigl(\SCRC\bigl(M(x), v(x)\bigr) \cdot x^{NW} + g_K(x) \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
    & = \left(x^{NW} \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{-nW} + x^D \sum_{n=0}^{N-1} m_{K, n}(x) \cdot x^{(N-1-n)W} \right) \bmod P(x) = \nonumber \\
    & = \left(x^{W} \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{(N-1-n)W} + x^D \sum_{n=0}^{N-1} m_{K, n}(x) \cdot x^{(N-1-n)W} \right) \bmod P(x) \nonumber \\
    & = \sum_{n=0}^{N-1} \Bigl( v_{K, n}(x) \cdot x^{W} + m_{K, n}(x) \cdot x^D\Bigr) \cdot x^{(N-1-n)W} \bmod P(x) = \label{e:additionalmultiply} \\
    & = \sum_{n=0}^{N-1} \SCRC\bigl(m_{K, n}(x), v_{K, n}(x)\bigr) \cdot x^{(N-1-n)W} \bmod P(x). \label{e:additionalmultiply2}
  \end{align}

(\ref{e:additionalmultiply2}) may be implemented using formula
(\ref{e:crcwordtable}) by setting $v'_0 = 0$, and then for $n = 0, \ldots,
N-1$ computing
  \begin{align*}
    v'_{n+1}(x)
      &= \Bigl(\bigl(v'_n(x) + v_{K, n}\bigr) \cdot x^W + m_{K, n} \cdot x^D\Bigr) \bmod P(x) \\
      &= \SCRC\bigl(m_{K,n}, v'_n(x) + v_{K, n} \bigr).
  \end{align*}

Alternatively, this step may be performed using the less efficient
technique described in section \ref{s:crcbyteword}.


% -----------------------------------------
\subsubsection{Efficient computation of individual CRCs} \label{s:compute}

Given $v(x)$, $\deg\bigl(v(x)\bigr) < D$ and $m(x)$, $\deg\bigl(m(x)\bigr) < W$,
\begin{align*}
    \mbox{CrcWordN}\bigl(m(x), v(x)\bigr)
    &= \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x)
\end{align*}
may be implemented efficiently utilizing the techniques described in
sections \ref{s:crcbyte}, \ref{s:crcbyteword}, and \ref{s:crcword}. When $D
\leq W$,
  \begin{align*}
    \mbox{CrcWordN}\bigl(m(x), v(x)\bigr)
    &= \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \\
    &= \Bigl(v(x) \cdot x^{W-D} + m(x) \Bigr) \cdot x^{(N-1)W + D} \bmod P(x),
  \end{align*}
and may be implemented using the table-driven multiplication as described
in (\ref{e:crcwordtable}) except that the operand is multiplied by
$x^{(N-1)W+D}$ instead of $x^D$. Like in (\ref{e:crcbytetable2}), explicit
multiplication of $v(x)$ by $x^{W-D}$ is not required since $D$-normalized
representation of $v(x)$, viewed as a $W$-normalized representation, is
equal to $\left(v(x) \cdot x^{W-D}\right)$.

Using the same technique as in formula (\ref{e:crcbyte}), for $D \geq W$
let
  \begin{align*}
    v_H(x) &= \left\lfloor\frac{v(x)}{x^{D-W}}\right\rfloor,
        & \deg\bigl(v_H(x)\bigr) &< W, \\
    v_L(x) &= v(x) \bmod x^{D-W},
        & \deg\bigl(v_L(x)\bigr) &< D-W,
  \end{align*}
so that $v(x) = v_L(x) + v_H(x) \cdot x^{D-W}$. Then,
  \begin{align}
    & \mbox{CrcWordN}\bigl(m(x), v(x)\bigr) = \nonumber \\
    & = \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
    & = \Bigl(\bigl(v_L(x) + v_H(x) \cdot x^{D-W}\bigr) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
    & = \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + m(x)\bigr) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
    & = \Bigl(\bigl(v_H(x) + m(x)\bigr) \cdot x^{(N-1)W + D} \bmod P(x)\Bigr) + \nonumber \\
    &   + \Bigl(\bigl(v_L(x) \cdot x^W \bigr) \cdot x^{(N-1)W} \bmod P(x)\Bigr). \label{e:crcwordinterleaved}
  \end{align}

Since $\deg\bigl(v_H(x)+m(x)\bigr) < W$, the first summand of
  $\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)$,
\begin{align*}
  \Bigl(\bigl(v_H(x) + m(x)\bigr) \cdot x^{(N-1)W + D} \bmod P(x)\Bigr),
\end{align*}
may be computed using the table-driven multiplication technique described
in (\ref{e:crcwordtable}) except that the operand is multiplied by
$x^{D+(N-1)W}$ instead of $x^D$.

Computation of the second summand of $\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)$,
  \begin{align*}
    \Bigl(\bigl(v_L(x) \cdot x^W \bigr) \cdot x^{(N-1)W} \bmod P(x)\Bigr),
  \end{align*}
is somewhat less intuitive. Since $\deg\bigl(v_L(x)\bigr) < D-W$,
  \begin{align*}
    \left(v_L(x) \cdot x^W\right) \bmod P(x) = \left(v_L(x) \cdot x^W\right),
  \end{align*}
and may be computed by shifting $v_L(x)$ by $W$ bits. Additional
multiplication by $x^{(N-1)W}$ is accomplished by adding $\bigl(v_L(x)
\cdot x^W\bigr)$, produced at step $n < N-1$ of the algorithm described by
formula (\ref{e:crcwordn}), to the value of $v_{k, n+1}(x)$ which will be
additionally multiplied by $x^{(N-1)W}$ as shown in formula
(\ref{e:crcwordnmultiply}).

For $n=N-1$, the value of $\bigl(v_L(x) \cdot x^W\bigr)$ should be added to
the value of $v_{k+1, n'}(x)$ where $n' = 0$. For $k < K$, it will be
multiplied by $x^{(N-1)W}$ during next round of parallel computation as
shown in (\ref{e:crcwordnmultiply}). For $k = K$, $v_{k+1, n'}(x)$ will be
multiplied by $x^{(N-1)W}$ during CRC concatenation as shown in
(\ref{e:additionalmultiply}) since $n'=0$.

\begin{figure}
\begin{lstlisting}[caption={Interleaved, word by word CRC computation},label={l:CrcMultiword}]
Crc CrcInterleavedWordByWord(
    Word *data, int blocks, Crc v, Crc u) {
  Crc crc[N+1] = {0};
  crc[0] = v ^ u;
  for (int i = 0; i < N*(blocks - 1); i += N) {
    Word buffer[N];
    // Load next N words and move overflow
    // bits into "next" word.
    for (int n = 0; n < N; ++n) {
      buffer[N] = crc[n] ^ data[i + n];
      if (D > sizeof(Word) * 8)
        crc[n+1] ^= crc[n] >> (sizeof(Word) * 8);
      crc[n] = 0;
    }
    // Compute interleaved word-by-word CRC.
    for (int byte = 0; byte < sizeof(Word); ++byte) {
      for (int n = 0; n < N; ++n) {
        crc[n] ^=
            MulInterleavedWordByXpowD[byte][(Byte) buffer[n]];
        buffer[n] >>= 8;
      }
    }
    // Combine crc[0] with delayed overflow bits.
    crc[0] ^= crc[N];
    crc[N] = 0;
  }
  // Process the last N bytes and combine CRCs.
  for (int n = 0; n < N; ++n) {
    if (n != 0) crc[0] ^= crc[n];
    Crc WordCrc = CrcOfWord(crc[0] ^ data[i + n]);
    if (D > sizeof(Word) * 8) {
      crc[0] >>= D - sizeof(Word) * 8;
      crc[0] ^= WordCrc;
    } else {
      crc[0] = WordCrc;
    }
  }
  return (crc[0] ^ u);
}
void InitInterleavedWordTables(void) {
  for (int byte = 0; byte < sizeof(Word); ++byte) {
    Crc m = XpowN(D - 8 + N*sizeof(Word)*8 - 8*byte);
    for (int i = 0; i < 256; ++i) {
      MulInterleavedWordByXpowD[byte][i] =
          MultiplyUnnormalized(i, 8, m);
    }
  }
}
\end{lstlisting}
\end{figure}


% -----------------------------------------
\section{Experimental results}

The tests were performed using Intel Q9650 3.0GHz CPU, DDR2-800 memory with
4-4-4-12 timing, and a motherboard with an Intel P45 chipset.


% -----------------------------------------
\subsection{Testing methology}

All tests were performed using random input data over various block sizes.
The code for all evaluated algorithms was heavily optimized. Tests were
performed on both aligned and non-aligned input data to ensure that
misaligned inputs do not carry performance penalty. CRC tables were aligned
on 256-byte boundary.

Tests were performed with warm data and warm CRC tables: as shown in
\cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
DBLP:journals/tc/KounavisB08}}, the footprint of CRC tables -- as long as
they fit into L1 cache -- is not a major contributor to the performance.

Performance was measured in number of CPU cycles per byte of input data:
apparently, performance of CRC computation is bounded by performance of CPU
and its L1 cache latency. Spot testing of few other Intel and AMD CPU
models showed little variation in performance measured in CPU cycles per
byte despite substential differences in CPU clock frequencies.

To minimize performance variations caused by interference with OS and other
applications (context switches, CPU migrations, CPU cache flushes, memory
bus interference from other processes, etc.), the test applications were
run at high priority, each test was executed multiple times, and the
minimum time was measured. That allowed the tests to achieve repeatability
within $\pm 1\%$.


% -----------------------------------------
\subsection{Compiler comparison}

Despite CRC code being rather straightforward, there were surprises (see
tables \ref{t:CompilerComparison128} and \ref{t:CompilerComparison64}).

On 64-bit AMD64 platform, Microsoft CL compiler (version 15.00.30729)
consistently and noticeably generated the fastest code using
general-purpose integer arithmetics (64-bit and smaller CRCs) -- 1.23 times
faster than the code generated by Intel's ICL 11.10.051 and 1.49 times
faster than the code generated by GCC 4.5.0. A tuned, hand-written inline
assembler code for CRC-32 and CRC-64 for GCC was as fast as the code
generated by CL.

When it comes to arithmetics with the use of SSE2 intrinsic functions on
64-bit AMD64 platform for 128-bit CRC, the code generated by GCC 4.5.0
consistenly outperformed the code generated by Microsoft and Intel
compilers -- by a factor of 1.21 and 1.33 respectively. However, earlier
versions of GCC did not produce efficient SSE2 code either. For that
reason, pre-4.5.0 versions of GCC use hand-written inline assember code
which was as fast as the code generated by GCC 4.5.0.

Neither compiler was able to generate efficient code on 32-bit bit I386
platform. Performance of the code that used MMX intrinsic functions was
better but still not as good as hand-written assember, which was provided
for all compilers.

The fastest code for 128-bit CRC on I386 platform was generated by GCC
4.5.0.


% -----------------------------------------
\subsection{Choice of interleave level}

Number of data streams processed by interleaved, word-by-word CRC
computation described in section \ref{s:multiword} should matter. Too few
means underutilization of available ALUs. Too many will increase the length
of the main loop and stress instruction decoders, and may cause splilling
of registers containing hot data (interleaved processing of $N$ words of
data uses at least $(2N+2)$ registers).

As table \ref{t:MultiwordPerfByStripe} shows, the optimal number of
interleaved data streams on modern Intel and AMD CPUs for integer
arithmetics is either 3 or 4 (likely because they all have exactly 3 ALUs).
However, for SSE2 arithmetics on AMD64 platform the optimal number of
streams is 6 (3 on I386), which is quite counter-intuitive result as it does
not correlate with the number of available ALUs. Good old performance
mantra "you need to measure" still applies.


% -----------------------------------------
\subsection{Performance of CRC algorithms}

Average performance of best variants of CRC algorithms for 64-bit AMD64 and
32-bit I386 platforms processing 1KB, 2KB, \ldots, 1MB inputs is given in
tables \ref{t:AveragePerformance64} and \ref{t:AveragePerformance64}
respectively. Proposed interleaved multiword CRC algorithm is 1.7-2.0 times
faster that current state of the art ``slicing''.

As demonstrated in tables \ref{t:CRC64Perf} and \ref{t:CRC32Perf},
interleaved word-by-word CRC described in section \ref{s:multiword},
running at 1.2 CPU cycles/byte, is 1.8 times faster than 2.1 CPU
cycles/byte achieved by current state of the art word-by-word CRC algorithm
(``slicing") described in \cite{Kounavis2005\remove{,
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}}.

On 64-bit AMD64 platform, the best performance was achieved using 64-bit
reads and 64-bit tables for all variants of $N$-bit CRC for $N \leq 64$. In
particular, tables \ref{t:CRC64Perf} and \ref{t:CRC32Perf} clearly show
that performance of 32-bit and 64-bit CRCs is nearly identical.
Consequently, there is no reason to favor CRC-32 over CRC-64 for
performance reasons.

The use of MMX on the 32-bit I386 platform allowed to utilize 64-bit tables
and 64-bit reads achieving 1.3 CPU cyles/byte. Neither compiler generated
efficient code using MMX intrinsic functions, so inline assembler was used.

With the use of SSE2 intrinsics on AMD64 architecture, 128-bit CRC may be
computed takes at 1.7 CPU cycles/byte using the new algorithm (see
table \ref{t:CRC128PerfMultiword}), compared with 2.9 CPU cycles/byte
achieved by word-by-word CRC computation (see table
\ref{t:CRC128PerfSlicing}). On the 32-bit I386 architecture, the use of SSE2
intrinsics and GCC 4.5.0 allowed the computation of 128-bit CRC at 2.1 CPU
cycles/byte, compared with 4.2 CPU cycles/byte delivered by
word-by-word algorithm.

Given that MD5 computation takes 6.8-7.1 CPU cycles/byte and SHA-1 takes
7.6-7.9 CPU cycles per byte, CRCs are still the algorithm of choice for
data corruption detection.


% -----------------------------------------
\bibliographystyle{alpha}
\bibliography{crc}

% -----------------------------------------
\appendix
\cleardoublepage


% -----------------------------------------
\begin{table}
\begin{center}

\caption{CRC performance, AMD64 platform} \label{t:AveragePerformance64}
\begin{tabular}{| l | c | c | c |}
  \hline
Method               & Slicing$^1$ & Multiword$^2$ & Improvement \\
  \hline
              CRC-32 &  $2.08^3$   &  $1.16^{4,5}$ & 1.79    \\
              CRC-64 &  $2.09^3$   &  $1.16^{4,5}$ & 1.79    \\
             CRC-128 &  $2.91^4$   &  $1.68^{4,6}$ & 1.73    \\
  \hline
\end{tabular}
{}

\caption{CRC performance, I386 platform} \label{t:AveragePerformance32}
\begin{tabular}{| l | c | c | c |}
  \hline
Method               & Slicing$^1$ & Multiword$^2$ & Improvement \\
  \hline
              CRC-32 &  $2.52^3$   &  $1.29^{3,7}$ & 1.96   \\
              CRC-64 &  $3.28^3$   &  $1.29^{3,7}$ & 2.55   \\
             CRC-128 &  $4.17^4$   &  $2.10^{4,8}$ & 1.98   \\
  \hline
\end{tabular}
{}
\end{center}


The best average number of CPU cycles per byte processing 1KB-1MB inputs.
Warm data, warm tables.

$^1$ {\it``Slicing"} implements the algorithm described in section
\ref{s:crcword}.

$^2$ {\it``Multiword/$N$"} implements algorithm described in section
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
manner.

$^3$ Microsoft CL 15.00.30729 compiler, ``-O2" flag.

$^4$ GCC 4.5.0 compiler, ``-O3" flag.

$^5$ Multiword/$N=4$, hand-written inline assembler.

$^6$ Multiword/$N=6$, C++.

$^7$ Multiword/$N=4$, hand-written MMX inline assember.

$^8$ Multiword/$N=3$, C++.

\end{table}


% --------------------------------------
\begin{table}
\begin{center}
\caption{Interleaved multiword CRC: choosing the number of stripes $N$}

\label{t:MultiwordPerfByStripe}
\begin{tabular}{| l | l | c | c | c | c | c | c | c |}
  \hline
    CRC             & Platform & N=2   & N=3   & N=4   & N=5   & N=6   & N=7   & N=8      \\
  \hline
            CRC-64$^9$ & AMD64 &  1.42 &  1.23 &  {\bf 1.17} &  1.46 &  2.08 &  2.59 &  2.73    \\
        CRC-128$^{10}$ & AMD64 &  2.07 &  1.84 &  1.76 &  1.70 &  {\bf 1.68} &  1.75 &  1.79    \\
        CRC-128$^{10}$ &  I386 &  2.56 &  {\bf 2.10} &  2.46 &  2.61 &  2.52 &  2.62 &  2.57    \\
  \hline
\end{tabular}
\end{center}
{}

Average number of CPU cycles per byte processing 1KB, 2KB, \ldots, 1MB
inputs. Interleaved word-by-word CRC computation as described in section
\ref{s:multiword}. Warm data, warm tables.

$^9$ Microsoft CL 15.00.30729 compiler, AMD64 platform, C++ code.

$^{10}$ GCC 4.5.0 compiler, AMD64 platform, C++ code.

\end{table}


% -----------------------------------------
\begin{table}
\begin{center}

\caption{Compiler comparison: Multiword/N, 64-bit CRC} \label{t:CompilerComparison64}
\begin{tabular}{| l | c | c | c | c | c | c | c | c | c |}
  \hline
Input size        & N & 64   & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
          GCC/C++ & 3 & 2.11 &  1.84 &  1.76 &  1.74 &  1.75 &  1.75 &  1.75 &  1.76    \\
              ICL & 3 & 2.35 &  1.65 &  1.48 &  1.44 &  1.44 &  1.45 &  1.45 &  1.45    \\
               CL & 4 & 1.75 &  1.29 &  1.18 &  1.15 &  1.17 &  1.18 &  1.18 &  1.18    \\
          GCC/ASM & 4 & 1.65 &  1.26 &  1.17 &  1.15 &  1.16 &  1.17 &  1.17 &  1.17    \\
  \hline
\end{tabular}
{}

\caption{Compiler comparison: Multiword/N, 128-bit CRC} \label{t:CompilerComparison128}
\begin{tabular}{| l | c | c | c | c | c | c | c | c | c |}
  \hline
Input size       & N & 64    & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
              CL & 5 &  4.08 &  2.56 &  2.43 &  2.25 &  2.20 &  2.19 &  2.18 &  2.20    \\
             ICL & 5 &  3.52 &  2.33 &  2.23 &  2.05 &  2.00 &  1.99 &  1.99 &  2.01    \\
             GCC & 6 &  2.90 &  1.93 &  1.85 &  1.72 &  1.65 &  1.63 &  1.63 &  1.63    \\
  \hline
\end{tabular}
{}
\end{center}

Number of CPU cycles per byte, best code for given compiler and CRC.

64-bit CRC (CRC-64-ECMA-182 polynomial) and 128-bit CRC (CRC-128/IEEE
polynomial) respectively. 64-bit platform, 64-bit reads. Warm data, warm
tables.

Microsoft CL 15.00.30729 compiler was used with ``-O2" flag. Intel ICL
11.10.051 and GCC 4.5.0 were used with ``-O3" flag.

\begin{center}
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CompilerComparison.pdf} \label{f:CompilerComparison}
\end{center}

{\it``Multiword/$N$"} implements algorithm described in section
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
manner.


\end{table}


% --------------------------------------
\begin{table}
\begin{center}

\caption{CRC-32 performance} \label{t:CRC32Perf}
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
  \hline
Input size           & 64    & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
             Sarwate &  6.61 &  6.62 &  6.70 &  6.68 &  6.67 &  6.66 &  6.67 &  6.75    \\
               Black &  5.44 &  5.46 &  5.47 &  5.48 &  5.47 &  5.46 &  5.47 &  5.53    \\
             Slicing &  2.15 &  2.10 &  2.09 &  2.09 &  2.08 &  2.08 &  2.08 &  2.10    \\
         Blockword/3 &  2.27 &  2.14 &  2.15 &  2.13 &  2.13 &  1.55 &  1.39 &  1.31    \\
         Multiword/4 &  1.75 &  1.29 &  1.18 &  1.16 &  1.17 &  1.18 &  1.18 &  1.18    \\
  \hline
\end{tabular}
\end{center}

Number of CPU cycles per byte. 32-bit CRC (CRC-32C polynomial), 64-bit
platform, 64-bit tables, 64-bit reads (except Sarwate). Microsoft CL
15.00.30729 compiler. Warm data, warm tables.

\begin{center}
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC32-full.pdf} \label{f:CRC32Perf}
\end{center}

{\it``Sarwate"} implements the algorithm described in section
\ref{s:crcbyte}.

{\it``Black"} implements the algorithm described in section
\ref{s:crcbyteword}.

{\it``Slicing"} implements the algorithm described in section
\ref{s:crcword}.

{\it``Blockword/3"} implements the algorithm described in section
\ref{s:blockword} with 3 stripes of 15,376 bytes each.

{\it``Multiword/4"} implements the algorithm described in section
\ref{s:multiword} processing 4 data streams in parallel in interleaved
manner.


\end{table}


% --------------------------------------
\begin{table}
\begin{center}

\caption{CRC-64 performance} \label{t:CRC64Perf}
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
  \hline
Input size           & 64    & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
             Sarwate &  6.61 &  6.62 &  6.70 &  6.68 &  6.67 &  6.65 &  6.66 &  6.75    \\
               Black &  5.44 &  5.46 &  5.47 &  5.47 &  5.47 &  5.47 &  5.47 &  5.53    \\
             Slicing &  2.16 &  2.08 &  2.09 &  2.10 &  2.08 &  2.08 &  2.08 &  2.09    \\
         Blockword/3 &  2.27 &  2.14 &  2.15 &  2.13 &  2.13 &  1.59 &  1.41 &  1.33    \\
         Multiword/4 &  1.75 &  1.29 &  1.18 &  1.15 &  1.17 &  1.18 &  1.18 &  1.18    \\
  \hline
\end{tabular}
\end{center}

Number of CPU cycles per byte. 64-bit CRC (CRC-64-ECMA-182 polynomial),
64-bit platform, 64-bit tables, 64-bit reads (except Sarwate). Microsoft CL
15.00.30729 compiler. Warm data, warm tables.

\begin{center}
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC64-small.pdf} \label{f:CRC64Perf}
\end{center}

{\it``Sarwate"} implements the algorithm described in section
\ref{s:crcbyte}.

{\it``Black"} implements the algorithm described in section
\ref{s:crcbyteword}.

{\it``Slicing"} implements the algorithm described in section
\ref{s:crcword}.

{\it``Blockword/3"} implements the algorithm described in section
\ref{s:blockword} with 3 stripes of 15,376 bytes each.

{\it``Multiword/4"} implements the algorithm described in section
\ref{s:multiword} processing 4 data streams in parallel in interleaved
manner.


\end{table}

% --------------------------------------
\begin{table}
\begin{center}


\caption{CRC-128 performance: Slicing CRC} \label{t:CRC128PerfSlicing}
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
  \hline
Input size   & 64    & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
     CL/SSE2 &  4.02 &  3.81 &  4.01 &  4.05 &  4.13 &  4.18 &  4.20 &  4.24    \\
    ICL/SSE2 &  3.40 &  3.24 &  3.57 &  3.59 &  3.68 &  3.72 &  3.75 &  3.81    \\
    GCC/UINT &  3.45 &  3.24 &  3.36 &  3.48 &  3.61 &  3.64 &  3.67 &  3.72    \\
    GCC/SSE2 &  2.67 &  2.48 &  2.63 &  2.79 &  2.97 &  2.99 &  2.99 &  3.03    \\
  \hline
\end{tabular}


\caption{CRC-128 performance: Multiword CRC} \label{t:CRC128PerfMultiword}
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
  \hline
Input size & 64    & 256   & 1K    & 4K    & 16K   & 64K   & 256K  & 1M       \\
  \hline
GCC/UINT/3 &  3.83 &  3.02 &  3.04 &  3.01 &  3.00 &  2.98 &  2.98 &  3.00    \\
 CL/SSE2/5 &  4.08 &  2.56 &  2.43 &  2.25 &  2.20 &  2.19 &  2.18 &  2.20    \\
ICL/SSE2/5 &  3.52 &  2.33 &  2.23 &  2.05 &  2.00 &  1.99 &  1.99 &  2.01    \\
GCC/SSE2/6 &  2.90 &  1.93 &  1.85 &  1.72 &  1.65 &  1.63 &  1.63 &  1.63    \\
  \hline
\end{tabular}
\end{center}

Number of CPU cycles per byte. 128-bit CRC (CRC-128/IEEE polynomial),
64-bit platform, 128-bit tables, 64-bit reads. Warm data, warm tables.

All compilers were tested using SSE2 intrinsics (/SSE2 variants). GCC was
also tested using 128-bit integers provided by the compiler (GCC/UINT).

\begin{center}
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC128-full.pdf} \label{f:CRC128Perf}
\end{center}

{\it``Slicing"} implements algorithm described in section \ref{s:crcword}.

{\it``Multiword/$N$"} implements algorithm described in section
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
manner. The optimal (for given compiler) value of $N$ was used.

\end{table}

\end{document}