mirror of
https://github.com/claunia/findcrcs.git
synced 2025-12-16 10:44:25 +00:00
1514 lines
59 KiB
TeX
1514 lines
59 KiB
TeX
\documentclass{article}
|
|
|
|
\usepackage{listings}
|
|
\usepackage{amsmath}
|
|
\usepackage{graphicx}
|
|
\usepackage{url}
|
|
\usepackage{authblk}
|
|
\usepackage{array} \setlength{\extrarowheight}{1.5pt}
|
|
|
|
|
|
\begin{document}
|
|
|
|
\lstset{
|
|
%language=C++, % choose the language of the code
|
|
basicstyle=\footnotesize, % code font size
|
|
numbers=left, % where to put the line-numbers
|
|
numberstyle=\footnotesize, % line number font size
|
|
stepnumber=1, % the step between two line-numbers.
|
|
%backgroundcolor=\color{white}, % choose the background color
|
|
frame=single,
|
|
framerule=1pt,
|
|
captionpos=b, % t or b
|
|
showstringspaces=false, % underline spaces within strings
|
|
showspaces=false, % show spaces within strings with underscores
|
|
showtabs=false, % show tabs within strings with underscores
|
|
breaklines=true % Break long lines of code
|
|
}
|
|
|
|
\def\CRC{{\rm CRC}_u}
|
|
\def\SCRC{{\rm CRC}_0}
|
|
\def\BYTE{{\rm BYTE}}
|
|
\def\LCD{{\rm LCD}}
|
|
\def\CrcWord{{\rm CrcWord}}
|
|
|
|
\def\remove#1{}
|
|
|
|
% -----------------------------------------
|
|
\title{Everything we know about CRC but afraid to forget}
|
|
\author[1]{Andrew Kadatch}
|
|
\affil[1]{Google Inc.}
|
|
\author[2]{Bob Jenkins}
|
|
\affil[2]{Microsoft Corporation}
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
This paper describes a novel interleaved, parallelizeable word-by-word
|
|
CRC computation algorithm which computes $N$-bit CRC ($N \leq 64$) on
|
|
modern Intel and AMD processors in 1.2 CPU cycles per byte, improving
|
|
state of the art over word-by-word 32-bit and 64-bit CRCs (2.1 CPU
|
|
cycles/byte) and classic byte-by-byte CRC computation (6-7 CPU cycles/byte).
|
|
It computes 128-bit CRC in 1.7 CPU cycles/byte.
|
|
|
|
CRC implementations are heavily optimized and hard to understand. This
|
|
paper describes CRC algorithms as they evolved over time, splitting
|
|
complex optimizations into a sequence of natural improvements.
|
|
|
|
This paper also presents a collection of CRC ``tricks" that we found
|
|
handy on many occassions.
|
|
\end{abstract}
|
|
|
|
\tableofcontents
|
|
|
|
|
|
% -----------------------------------------
|
|
\section{Definition of CRC}
|
|
|
|
Cyclic Redundancy Check (CRC) is a well-known technique that allows the
|
|
recipient of a message transmitted over a noisy channel to detect whether
|
|
the message has been corrupted.
|
|
|
|
A message $M = m_0 \dots m_{N-1}$ comprised of $N=|M|$ bits ($m_k \in \{0,
|
|
1\}$) may be viewed either as a numeric value
|
|
\begin{align*}
|
|
M = \sum_{k=0}^{N-1} m_k 2^{N-1-k}
|
|
\end{align*}
|
|
or as a polynomial of a single variable of degree $(N-1)$
|
|
\begin{align*}
|
|
M(x) = \sum_{k=0}^{N-1} m_k x^{N-1-k}
|
|
\end{align*}
|
|
where $m_k \in GF(2) = \{0, 1\}$ and all arithmetic operations on
|
|
coefficients are performed modulo 2. For example,
|
|
\begin{align*}
|
|
& \mbox{Addition: }
|
|
(x^3+x^2+x+1) + (x^2+x+1) = x^3+2x^2+2x+2 = x^3, \\
|
|
& \mbox{Subtraction: }
|
|
(x^3+x+1) - (x^2+x) = x^3-x^2+1 = x^3+x^2+1, \\
|
|
& \mbox{Multiplication: }
|
|
(x+1)(x+1) = x^2 + 2x + 1 = x^2 + 1.
|
|
\end{align*}
|
|
|
|
For a given polynomial $P(x)$ of degree $D=\deg\bigl(P(x)\bigr)$,
|
|
$\CRC\bigl(M(x),v(x)\bigr)$ is the reminder from division of $\left(M(x)
|
|
\cdot x^D\right)$ by $P(x)$. In practice, a more complex formula is used:
|
|
\begin{align}
|
|
\label{e:crcdefinition}
|
|
\CRC\bigl(M(x), v(x)\bigr)
|
|
= \Bigl(\bigl(v(x)-u(x)\bigr) \cdot x^{|M|} + M(x) \cdot x^D + u(x)\Bigr)
|
|
\bmod P(x),
|
|
\end{align}
|
|
where polynomial $P(x)$ of degree $D$ and polynomial $u(x)$ of degree less
|
|
than $D$ are fixed.
|
|
|
|
The use of the non-zero value of $u(x)$ guarantees that the CRC of a sequence
|
|
of zeroes is different from zero. That allows detection of insertion of
|
|
zeroes in the beginning of a message and replacement of both content of the
|
|
message and its CRC value with zeroes. Typically,
|
|
\begin{align}
|
|
\label{e:constdefinition}
|
|
u(x) &= \sum_{k=0}^{D-1} x^k.
|
|
\end{align}
|
|
|
|
The use of auxilary parameter $v(x)$ allows incremental CRC computation
|
|
as shown in section \ref{s:incrementalcrc}.
|
|
|
|
% -----------------------------------------
|
|
\section{Related work}
|
|
|
|
Cyclic Redundancy Checks (CRCs) were proposed by Peterson and Brown
|
|
\cite{Peterson61} in 1961. An efficient table-driven software
|
|
implementation which reads and processes data byte by byte was described by
|
|
Hill \cite{Hill79} in 1979, Perez \cite{Perez83} in 1983. The ``classic"
|
|
byte-by-byte CRC algorithm described in section \ref{s:crcbyte} was
|
|
published by Sarwate \cite{DBLP:journals/cacm/Sarwate88} in 1988.
|
|
|
|
In 1993, Black \cite{Black93} published a method that reads data by words
|
|
(described in section \ref{s:crcbyteword}); however, it still computes the
|
|
CRC byte by byte in strong sequential order.
|
|
|
|
In 2001, Braun and Waldvogel \cite{\remove{%braun01fast,
|
|
}braun01fast-techreport} briefly outlined a specialized variant of a CRC
|
|
that could read input data by words and process them byte by byte -- but,
|
|
thanks to the use of multiple tables, different bytes from the input word
|
|
could be processed in parallel. In 2002, Ji and Killian \cite{JiKillian02}
|
|
provided detailed description and analysis of a nearly identical scheme.
|
|
Both solutions were targeted for hardware implementation. In 2005, Kouvanis
|
|
and Berry \cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
|
|
DBLP:journals/tc/KounavisB08}} demonstrated clear performance benefits of
|
|
this scheme even when it is implemented in software. A generalized version
|
|
of this approach is described in section \ref{s:crcword}.
|
|
|
|
Surprisingly, until \cite{Gopal2010} we have not seen prior art describing
|
|
or utilizing a method of computing a CRC by processing in parallel (in an
|
|
interleaved manner to utilize multiple ALUs) multiple input streams
|
|
belonging to non-overlapping sections of input data, desribed in section
|
|
\ref{s:blockword}.
|
|
|
|
A novel method of CRC computation that processes in parallel multiple words
|
|
belonging to overlapping sections of input data is described in section
|
|
\ref{s:multiword}. A special case restricted to the use of 64-bit tables,
|
|
64-bit reads, and 32 or 64-bit generating polynomials was implemented by
|
|
the authors in February-March 2007 and was used by a couple of Microsoft
|
|
products. In 2009, the algorithm was generalized and these limitations were
|
|
removed.
|
|
|
|
The fact that the CRC of a message followed by its CRC is a constant value
|
|
which does not depend on the message, described in section
|
|
\ref{s:storingcrcafter}, is well known and has been widely used in the
|
|
telecommunication industry for long time.
|
|
|
|
A method of storing a carefully chosen sequence of bits after a message so
|
|
that the CRC of a message and the sequence of bits appended to the message
|
|
produces predefined result, described in \ref{s:storingcrcafter}, was
|
|
implemented in 1990 by Zemtsov \cite{Zemtsov90}.
|
|
|
|
A method for recomputing a known CRC using a new initial CRC value,
|
|
described in section \ref{s:changinginitialvalue}, and the method of
|
|
computing a CRC of the concatenation of messages having known CRC values
|
|
without touching the actual data, described in section
|
|
\ref{s:concatenation}, were implemented by one of the authors in 2005 but
|
|
were not published.
|
|
|
|
|
|
% -----------------------------------------
|
|
\section{CRC tricks and tips}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Incremental CRC computation} \label{s:incrementalcrc}
|
|
|
|
The use of an arbitrary initial CRC value $v(x)$ allows computation of a CRC
|
|
incrementally. If a message
|
|
$M(x) = M_1(x) \cdot x^{|M_2|} + M_2(x)$
|
|
is a concatenation of messages $M_1$ and $M_2$, its CRC may be computed
|
|
piece by piece because
|
|
\begin{align}
|
|
\label{e:incremental}
|
|
\CRC\bigl(M(x), v(x)\bigr)
|
|
&= \CRC\Bigl(M_2(x), \CRC\bigl(M_1(x), v(x)\bigr)\Bigr).
|
|
\end{align}
|
|
|
|
Indeed,
|
|
\begin{align*}
|
|
\CRC(M, v)
|
|
&= \bigl((v-u) x^{|M|} + M x^D + u\bigr) \bmod P = \\
|
|
&= \bigl((v-u) x^{|M_1|+|M_2|} + (M_1 x^{|M_2|} + M_2) x^D + u\bigr) \bmod P = \\
|
|
&= \Bigl(\bigl((v-u) x^{|M_1|} + M_1 x^D \bigr) x^{|M_2|} + M_2 x^D + u\Bigr) \bmod P = \\
|
|
&= \bigl(\CRC(M_1, v) x^{|M_2|} + M_2 x^D + u\bigr) \bmod P = \\
|
|
&= \CRC\bigl(M_2, \CRC(M_1, v)\bigr)
|
|
\end{align*}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Changing initial CRC value} \label{s:changinginitialvalue}
|
|
|
|
If
|
|
$\CRC\bigl(M(x), v(x)\bigr)$
|
|
for some initial value $v(x)$ is known, it is possible to compute
|
|
$\CRC\bigl(M(x), v'(x)\bigr)$
|
|
for different initial value $v'(x)$ without touching the value of $M(x)$:
|
|
\begin{align}
|
|
\CRC(M, v')
|
|
&= \CRC(M,v) + \Bigl((v'-v) x^{|M|}\Bigr) \bmod P.
|
|
\label{e:fixv}
|
|
\end{align}
|
|
|
|
Proof:
|
|
\begin{align*}
|
|
\CRC(M, v')
|
|
&= \bigl((v'-u) x^{|M|} + M x^D + u\bigr) \bmod P = \\
|
|
&= \Bigl(\bigl((v'-u)+(v-v)\bigr) x^{|M|} + M x^D + u\Bigr) \bmod P = \\
|
|
&= \Bigl(\bigl((v-u)+(v'-v)\bigr) x^{|M|} + M x^D + u\Bigr) \bmod P = \\
|
|
&= \Bigl(\bigl((v-u) x^{|M|} + M x^D + u\bigr) + (v'-v) x^{|M|}\Bigr) \bmod P = \\
|
|
&= \CRC(M,v) + \Bigl((v'-v) x^{|M|} \bmod P \Bigr).
|
|
\end{align*}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Concatenation of CRCs} \label{s:concatenation}
|
|
|
|
If a message
|
|
$M(x) = M_1(x) \cdot x^{|M_2|} + M_2(x)$
|
|
is a concatenation of messages $M_1$ and $M_2$, and CRCs of $M_1$, $M_2$
|
|
(computed with some initial values $v_1(x)$, $v_2(x)$ respectively) are
|
|
known,
|
|
$\CRC\bigl(M(x),v(x)\bigr)$
|
|
may be computed without touching contents of the message $M$:
|
|
\begin{enumerate}
|
|
\item
|
|
Using formula (\ref{e:fixv}), the value of $v'_1 = \CRC(M_1,v)$ may
|
|
be computed from the known $\CRC(M_1,v_1)$ without touching the contents of $M_1$.
|
|
\item
|
|
Then, $v'_2 = \CRC(M_2, v'_1)$ may be computed from known $\CRC(M_2,v_2)$
|
|
without touching the contents of $M_2$.
|
|
\end{enumerate}
|
|
According to (\ref{e:incremental}), $\CRC(M,v) = v'_2$.
|
|
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{In-place modification of CRC-ed message} \label{s:replacement}
|
|
|
|
Sometimes it is necessary to replace a part of message $M(x)$ in-place and
|
|
recompute CRC of modified message $M(x)$ efficiently.
|
|
|
|
If a message $M=ABC$ is a concatenation of messages $A$, $B$, and $C$, and
|
|
$B'(x)$ is new message of the same length as $B(x)$, $\CRC(M')$ of message
|
|
$M'=AB'C$ may be computed from known $\CRC(M)$. Indeed,
|
|
\begin{align*}
|
|
M(x) &= A(x) \cdot x^{|B| + |C|} + B(x) \cdot x^{|C|} + C(x), \\
|
|
M'(x) &= A(x) \cdot x^{|B| + |C|} + B'(x) \cdot x^{|C|} + C(x) = \\
|
|
&= M(x) + \bigl(B'(x) - B(x)\bigr) \cdot x^{|C|},
|
|
\end{align*}
|
|
therefore
|
|
\begin{align*}
|
|
& \CRC\bigl(M'(x),v(x)\bigr) = \\
|
|
&= \CRC\Bigl(M(x) +\bigl(B'(x) - B(x)\bigr) \cdot x^{|C|}\Bigr) = \\
|
|
&= \Bigl(\bigl(v(x)-u(x)\bigr) x^{|M|} + M(x) x^D + \bigl(B'(x) - B(x)\bigr) x^{|C| + D} + u(x)\Bigr) \bmod P(x) \\
|
|
&= \Bigl(\CRC\bigl(M(x),v(x)\bigr) + \bigl(B'(x) - B(x)\bigr) x^{|C| + D}\Bigr) \bmod P(x) = \\
|
|
&= \CRC\bigl(M(x),v(x)\bigr) + \Bigl(\bigl(B'(x) - B(x)\bigr) x^{|C| + D} \bmod P(x)\Bigr).
|
|
\end{align*}
|
|
|
|
It is easy to see that
|
|
\begin{align*}
|
|
& \CRC\bigl(B'(x),v(x)\bigr) - \CRC\bigl(B(x),v(x)\bigr) = \\
|
|
&= \bigl(B'(x) - B(x)\bigr) x^{D} \bmod P(x),
|
|
\end{align*}
|
|
so
|
|
\begin{align*}
|
|
& \CRC\bigl(M'(x),v(x)\bigr) = \CRC\bigl(M(x),v(x)\bigr) + \Delta \\
|
|
\end{align*}
|
|
where
|
|
\begin{align*}
|
|
& \Delta = \Bigl(\CRC\bigl(B'(x),v(x)\bigr) - \CRC\bigl(B(x),v(x)\bigr) \Bigr) x^{|C|} \bmod P(x).
|
|
\end{align*}
|
|
|
|
% -----------------------------------------
|
|
\subsection{Storing CRC value after the message} \label{s:storingcrcafter}
|
|
|
|
Often $Q(x) = \CRC\bigl(M(x),v(x)\bigr)$ is padded with zero bits until the
|
|
nearest byte or word boundary and is transmitted as a sequence of $W$ bits
|
|
($W \geq D$) right after the message $M(x)$. This way, the transmitted
|
|
message $T(x)$ is the concatenation of $M(x)$ and $Q(x)$ followed by
|
|
$(W-D)$ zeroes, and is equal to
|
|
\begin{align*}
|
|
T(x) = M(x) \cdot x^W + Q(x) \cdot x^{W-D}.
|
|
\end{align*}
|
|
|
|
According to (\ref{e:crcdefinition}), (\ref{e:incremental}) and taking into
|
|
account that $Q(x)+Q(x) = 0$ since polynomial coefficient are from $GF(2)$,
|
|
$\CRC\bigl(T(x), v(x)\bigr)$ is a constant value which does not depend on
|
|
the contents of the message and is equal to
|
|
\begin{align*}
|
|
& \CRC\bigl(T(x), v(x)\bigr) = \\
|
|
& = \CRC\Bigl(Q(x) \cdot x^{W-D}, CRC\bigl(M(x), v(x)\bigr)\Bigr) = \\
|
|
& = \CRC\bigl(Q(x) \cdot x^{W-D}, Q(x)\bigr) = \\
|
|
& = \Bigl(\bigl(Q(x)-u(x)\bigr) \cdot x^W + Q(x)\cdot x^{W-D} \cdot x^D + u(x)\Bigr)
|
|
\bmod P(x) = \\
|
|
& = \Bigl(u(x)\left(1 - x^W\right)\Bigr) \bmod P(x).
|
|
\end{align*}
|
|
|
|
A more generic solution is to store a $W$-bit long value after the message
|
|
such that the CRC of the transmitted message is equal to a predefined value
|
|
$R(x)$ (typically $R(x)=0$). The $D$-bit value followed by $(W-D)$ zero
|
|
bits that should be stored after $M(x)$ is
|
|
\begin{align*}
|
|
\hat{q}\bigl(Q(x)\bigr) = \Bigl(\bigl(R(x) - u(x)\bigr) x^{-W} - \bigl(Q(x) - u(x)\bigr)\Bigr) \bmod P(x)
|
|
\end{align*}
|
|
where $x^{-W}$ is the multiplicative inverse of $x^W \bmod P(x)$ which
|
|
exists if $P(x)$ is not divisble by $x$ and may be found by the extended
|
|
Euclidean algorithm \cite{Hasan01}:
|
|
\begin{align*}
|
|
& \CRC\Bigl(\hat{q}\bigl(Q(x)\bigr)x^{W-D}, CRC\bigl(M(x), v(x)\bigr)\Bigr) = \\
|
|
& = \CRC\Bigl(\hat{q}\bigl(Q(x)\bigr)x^{W-D}, Q(x)\Bigr) = \\
|
|
& = \Bigl(\bigl(Q(x)-u(x)\bigr) \cdot x^W + \hat{q}\bigl(Q(x)\bigr) \cdot x^{W-D} \cdot x^D + u(x)\Bigr)
|
|
\bmod P(x) = \\
|
|
& = R(x).
|
|
\end{align*}
|
|
|
|
|
|
% -----------------------------------------
|
|
\section{Efficient software implementation}
|
|
|
|
% -----------------------------------------
|
|
\subsection{Mapping bitstreams to hardware registers}
|
|
|
|
For little-endian machines (assumed from now on), the result of loading of
|
|
a $D$-bit word from memory into hardware register matches the expectations:
|
|
the 0-th bit of the 0-th byte becomes the 0-th (least significant) bit of
|
|
the word corresponding to $x^{(D-1)}$.
|
|
|
|
For example, the 32-bit sequence of 4 bytes 0x01, 0x02, 0x03, 0x04
|
|
(0x04030201 when loaded into a 32-bit hardware register) corresponds to the
|
|
polynomial
|
|
\begin{align*}
|
|
\left(x^{31} + x^{22} + x^{15} + x^{14} + x^{5}\right).
|
|
\end{align*}
|
|
|
|
Addition and subtraction of polymonials with coefficients from $GF(2)$ is
|
|
the bitwise XOR of their coefficients. Multiplication of a polynomial by
|
|
$x$ is achieved by logical right shift of register contents by 1 bit. If a
|
|
shift operation causes a carryover, the resulting polynomial has degree
|
|
$D$.
|
|
|
|
Polynomials of degree less than $D$ whose coefficients are recorded using
|
|
exactly $D$ bits irrespective of actual degree of the polynomial will be
|
|
called {\it $D$-normalized}.
|
|
|
|
Whenever possible -- and unless mentioned explicitly -- all polynomials
|
|
will be represented in $D$-normalized form.
|
|
|
|
Since the generating polynomial $P(x)$ is of degree $D$ and has $(D+1)$
|
|
coefficients, it does not fit into the $D$-bit register. However, its most
|
|
significant coefficient is guaranteed to be 1 and may be implied
|
|
implicitly.
|
|
|
|
% -----------------------------------------
|
|
\subsection{Multiplication of $D$-normalized polynomials} \label{s:shiftandadd}
|
|
|
|
Multiplication of two $D$-normalized polynomials may be accomplished by
|
|
traditional bit-by-bit, shift-and-add multiplication. This is adequate if
|
|
performance is not a concern. Sample code is given in listing
|
|
\ref{l:MulNormalizedPoly}.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Multiplication of normalized polynomials},label={l:MulNormalizedPoly}]
|
|
// "a" and "b" occupy D least significant bits.
|
|
Crc Multiply(Crc a, Crc b) {
|
|
Crc product = 0;
|
|
Crc bPowX[D]; // bPowX[k] = (b * x**k) mod P
|
|
bPowX[0] = b;
|
|
for (int k = 0; k < D; ++k) {
|
|
// If "a" has non-zero coefficient at x**k,
|
|
// add ((b * x**k) mod P) to the result.
|
|
if (((a & (1 << (D-k)) != 0) product ^= bPowX[k];
|
|
|
|
// Compute bPowX[k+1] = (b ** x**(k+1)) mod P.
|
|
if (bPowX[k] & 1) {
|
|
// If degree of (bPowX[k] * x) is D, then
|
|
// degree of (bPowX[k] * x - P) is less than D.
|
|
bPowX[k+1] = (bPowX[k] >> 1) ^ P;
|
|
} else {
|
|
bPowX[k+1] = bPowX[k] >> 1;
|
|
}
|
|
}
|
|
return product;
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
% -----------------------------------------
|
|
\subsection{Multiplication of unnormalized polynomial}
|
|
|
|
During initialization of CRC tables it may be necessary to multiply
|
|
$d$-normalized polynomial $v(x)$ of a degree $d \neq D$ by a $D$-normalized
|
|
polynomial. It may be accomplished by representing the operand as a sum of
|
|
weighted polynomials of degree of no more than $(D-1)$, then calling
|
|
$Multiply()$ function repeatedly as shown in listing
|
|
\ref{l:MulUnnormalizedPoly}.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Multiplication of unnormalized polynomial},label={l:MulUnnormalizedPoly}]
|
|
// "v" occupies "d" least signficant bits.
|
|
// "m" occupies D least significant bits.
|
|
Crc MultiplyUnnormalized(Crc v, int d, Crc m) {
|
|
Crc result = 0;
|
|
while (d > D) {
|
|
Crc temp = v & ((1 << D) - 1);
|
|
v >>= D;
|
|
d -= D;
|
|
// XpowN returns (x**N mod P(x)).
|
|
result ^= Multiply(temp, Multiply(m, XpowN(d)));
|
|
}
|
|
result ^= Multiply(v << (D - d), m);
|
|
return result;
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Computing powers of $x$} \label{s:mulpown}
|
|
|
|
Often (see sections \ref{s:changinginitialvalue}, \ref{s:concatenation},
|
|
\ref{s:storingcrcafter}) it is necessary to compute $x^N \bmod P(x)$ for
|
|
very large values of $N$. This may be accomplished in
|
|
$O\bigl(\log(N)\bigr)$ time.
|
|
|
|
Consider the binary representation of $N$:
|
|
\begin{align*}
|
|
N = \sum_{k=0}^K n_k 2^k
|
|
\end{align*}
|
|
where $n_k \in \{0, 1\}$. Then
|
|
\begin{align}
|
|
x^N &= x^{\sum n_k 2^k}
|
|
= \prod_{k=0}^K x^{n_k 2^k}
|
|
= \prod_{n_k != 0} x^{2^k} \label{e:pow2k}
|
|
\end{align}
|
|
and may be computed using no more than
|
|
$\left(\left\lfloor \log_2(N) \right\rfloor + 1\right)$
|
|
multiplications of polynomials of degree less than $D$ provided known
|
|
values of
|
|
\begin{align}
|
|
Pow2k(k) = x^{2^k} \bmod P(x).
|
|
\end{align}
|
|
|
|
Values of $Pow2k(k)$ may be computed iteratively using one multiplication
|
|
$\bmod P(x)$ per iteration:
|
|
\begin{align*}
|
|
Pow2k(0) &= 0, \\
|
|
Pow2k(k + 1)
|
|
&= x^{2^{k+1}} \bmod P(x) = \\
|
|
&= x^{2 \cdot 2^k} \bmod P(x) = \\
|
|
&= \left(x^{2^k}\right)^2 \bmod P(x) = \\
|
|
&= \Bigl(Pow2k(k-1)\Bigr)^2 \bmod P(x).
|
|
\end{align*}
|
|
|
|
% -----------------------------------------
|
|
\subsection{Simplified CRC}
|
|
|
|
It is sufficient to be able to compute
|
|
\begin{align}
|
|
\SCRC\bigl(M(x), v(x)\bigr)
|
|
&= \Bigl(v(x) \cdot x^{|M|} + M(x) \cdot x^D\Bigr)
|
|
\bmod P(x), \label{e:simplifiedcrc}
|
|
\end{align}
|
|
since
|
|
\begin{align*}
|
|
\CRC\bigl(M(x),v(x)\bigr) &= \SCRC\bigl(M(x), v(x) - u(x)\bigr) + u(x),
|
|
\end{align*}
|
|
$\CRC\bigl(M(x),v(x)\bigr)$ of message $M = M_1 \ldots M_K$ may be computed
|
|
incrementally using $\SCRC$ instead of $\CRC$:
|
|
\begin{align*}
|
|
v_0(x) &= v(x) - u(x), \\
|
|
v_k(x) &= \SCRC\bigl(M_k(x),v_{k-1}(x)\bigr), \\
|
|
\CRC(M(x), v(x)) &= v_K + u(x).
|
|
\end{align*}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Computing a CRC byte by byte} \label{s:crcbyte}
|
|
|
|
If $M(x)$ is $W$-bit value (typically, $W=8$) and
|
|
$\deg\bigl(v(x)\bigr) < D$, by definition (\ref{e:simplifiedcrc})
|
|
\begin{align*}
|
|
\SCRC\bigl(M(x), v(x)\bigr)
|
|
= \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x).
|
|
\end{align*}
|
|
|
|
When $D \leq W$,
|
|
\begin{align}
|
|
\SCRC\bigl(M(x), v(x)\bigr)
|
|
&= \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(\bigl(v(x) \cdot x^{W-D} + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x), \label{e:crcbytetable2}
|
|
\end{align}
|
|
which may be obtained via single lookup into precomputed table $T$ of size
|
|
$2^W$ such that $T[i] = \bigl(i(x) \cdot x^D)\bigr) \bmod P(x)$ since
|
|
$\deg\bigl(v(x) \cdot x^{W-D} + M(x)\bigr) < W$.
|
|
|
|
$D$-normalized representation of $v(x)$ occupies $D$ least significant bits
|
|
and is equal to $\left(v(x) \cdot x^{W-D}\right)$ when viewed as
|
|
$W$-normalized representation which is required to form $W$-bit index into
|
|
a table of $2^W$ entries. Therefore, explicit multiplication of $v(x)$ by
|
|
$x^{W-D}$ in formula (\ref{e:crcbytetable2}) is not required.
|
|
|
|
When $D \geq W$, $v(x)$ may be represented as
|
|
\begin{align*}
|
|
v(x) = v_L(x) + v_H(x) \cdot x^{D-W}
|
|
\end{align*}
|
|
where
|
|
\begin{align*}
|
|
v_H(x) &= \left\lfloor\frac{v(x)}{x^{D-W}}\right\rfloor,
|
|
&\deg\bigl(v_H(x)\bigr) &< W, \\
|
|
v_L(x) &= v(x) \bmod x^{D-W},
|
|
&\deg\bigl(v_L(x)\bigr) &< D-W.
|
|
\end{align*}
|
|
|
|
Since $\deg\bigl(v_L(x) \cdot x^W \bigr) < D$, $\Bigl(v_L(x) \cdot
|
|
x^W\Bigr) \bmod P(x) = v_L(x) \cdot x^W$. Therefore,
|
|
\begin{align}
|
|
& \SCRC\bigl(M(x), v(x)\bigr) = \nonumber \\
|
|
&= \Bigl(v(x) \cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(\bigl(v_L(x) + v_H(x) \cdot x^{D-W}\bigr)\cdot x^W + M(x) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + M(x)\bigr) \cdot x^D\Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(v_L(x) \cdot x^W\Bigr) \bmod P(x) + \Bigl(\bigl(v_H(x) + M(x)\bigr) \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(v_L(x) \cdot x^W\Bigr) + \mbox{MulByXpowD}\bigl(v_H(x) + M(x)\bigr), \label{e:crcbyte}
|
|
\end{align}
|
|
where
|
|
\begin{align}
|
|
\mbox{MulByXpowD}\bigl(a(x)\bigr) = \bigl(a(x) \cdot x^D \bigr) \bmod P(x). \label{e:crcbytetable}
|
|
\end{align}
|
|
|
|
The value of $\bigl(v_L(x) \cdot x^W\bigr)$ may be computed by shifting
|
|
$v(x)$ by $W$ bits and discarding $W$ carry-over zero bits.
|
|
|
|
Since $\deg\bigl(v_H(x) + M(x)\bigr) < W$, the value of
|
|
$\mbox{MulByXpowD}\bigl(v_H(x) + M(x)\bigr)$ may be obtained using
|
|
precomputed table containing $2^W$ entries.
|
|
|
|
The classic table-driven, byte-by-byte CRC computation \cite{Perez83,
|
|
DBLP:journals/cacm/Sarwate88} implementing formulas
|
|
(\ref{e:crcdefinition}), (\ref{e:incremental}), (\ref{e:crcbytetable2}),
|
|
(\ref{e:crcbyte}), and (\ref{e:crcbytetable}) for $W=8$ is given in listing
|
|
\ref{l:CrcByte}.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Computing CRC byte by byte},label={l:CrcByte}]
|
|
Crc CrcByte(Byte value) {
|
|
return MulByXpowD[value];
|
|
}
|
|
Crc CrcByteByByte(Byte *data, int n, Crc v, Crc u) {
|
|
Crc crc = v ^ u;
|
|
for (int i = 0; i < n; ++i) {
|
|
Crc ByteCrc = CrcByte(crc ^ data[i]);
|
|
crc >>= 8;
|
|
crc ^= ByteCrc;
|
|
}
|
|
return (crc ^ u);
|
|
}
|
|
void InitByteTable() {
|
|
for (int i = 0; i < 256; ++i) {
|
|
MulByXPowD[i] = MultiplyUnnormalized(i, 8, XpowN(D));
|
|
}
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
Experience shows that computing CRC byte by byte is rather slow and,
|
|
depending on a compiler and input data size, takes $6-8$ CPU cycles per
|
|
byte on modern 64-bit CPU for $D <= 64$. There are two reasons for it:
|
|
|
|
\begin{enumerate}
|
|
\item
|
|
Reading data 8 bits at a time is not the most efficient data access
|
|
method on 64-bit CPU.
|
|
\item
|
|
Modern CPUs have multiple ALUs and may execute 3-4 instructions per CPU
|
|
cycles provided the instructions handle independent data flows. However,
|
|
byte-by-byte CRC contains only one data flow. Futhermore, most
|
|
instructions use the result from the previous instruction, leading to CPU
|
|
stalls because of result propagation delays.
|
|
\end{enumerate}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Rolling CRC} \label{s:rollingcrc}
|
|
|
|
Given a set of messages $M_k=m_{k} \ldots m_{k+N-1}$ where $m_k$ are
|
|
$W$-bit symbols and $N$ is fixed (i.e. each next message is obtained by
|
|
removing first symbol and appending new one), $C_{k+1} = \CRC(M_{k+1}, v)$
|
|
may be obtained from known $C_k = \CRC(M_k, v)$ and symbols $m_k$ and
|
|
$m_{k+N}$ only, without the need to compute CRC of entire message
|
|
$M_{k+1}$. This property may be utilized to efficiently compute a set of
|
|
rolling Rabin fingerpints.
|
|
|
|
Since $M_{k+1}(x) = M_k(x) x^W - m_{k}(x) x^{NW} + m_{k+N}(x)$,
|
|
\begin{align*}
|
|
& C_{k+1}(x) = \CRC\bigl(M_{k+1}(x), v(x)\bigr) = \\
|
|
&= \left(\bigl(v(x)-u(x)\bigr) x^{NW} + u(x) + \sum_{n=0}^{N-1} m_{k+1+n}(x) x^{D+W(N-1-n)} \right) \bmod P(x) = \\
|
|
&= F\bigl(C_k(x), m_{k+N}(x)\bigr) + G\bigl(m_k(x)\bigr),
|
|
\end{align*}
|
|
where
|
|
\begin{align*}
|
|
& F\bigl(C_k(x), m_{k+N}(x)\bigr) = \Bigl(C_k(x) x^W + m_{k+N}(x) x^D\Bigr) \bmod P, \\
|
|
& G\bigl(m_k(x)\bigr) = \Bigl(\bigl(\bigl(v(x)-u(x)\bigr) x^{NW} + u\bigr) (1 - x^W) - m_k(x) x^{D+NW} \Bigr) \bmod P
|
|
\end{align*}
|
|
are polynomials of degree less than $D$.
|
|
|
|
$G\bigl(m_{k-1}(x)\bigr)$ may be computed easily via a single lookup
|
|
in a table of $2^W$ entries indexed by $m_k$.
|
|
|
|
Computation of $F\bigl(C_k(x), m_{k+N}(x)\bigr)$ may be implemented as
|
|
described in section \ref{s:crcbyte} and requires one bitwise shift, one
|
|
bitwise XOR, and one lookup into a precomputed table containing $2^W$
|
|
entries.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Reading multiple bytes at a time} \label{s:crcbyteword}
|
|
|
|
One straightforward way to speed up byte-by-byte CRC computation is to read
|
|
$W > 8$ bits at once. Unfortunately, this is the path of very rapidly
|
|
diminishing return as the size of the MulByPowD table increases with $W$
|
|
exponentially. From practical perspective, it is extremely desirable to
|
|
ensure that the MulByPowD table fits into the L1 cache (32-64KB), otherwise
|
|
table entry access latency sharply increases from 3-4 CPU cycles (L1 cache)
|
|
to 15-20 CPU (L2 cache).
|
|
|
|
The value of $\mbox{MulByXpowD}\bigl(v(x)\bigr)$ may be computed
|
|
iteratively using a smaller table because
|
|
\begin{align}
|
|
\mbox{MulByXpowD}\bigl(v(x)\bigr)
|
|
= v(x) \cdot x^D \bmod P(x)
|
|
= \SCRC\bigl(v(x), 0\bigr) \label{e:readwordatonce}
|
|
\end{align}
|
|
and therefore may be computed using formulas (\ref{e:incremental}) and
|
|
(\ref{e:crcbyte}) for smaller values of $W'$.
|
|
|
|
\cite{Black93} provided the implementation for $W=32$ and $W'=8$. Our more
|
|
general implementation was faster than byte-by-byte CRC but not
|
|
substentially: the improvement was in 20-25\% range. However, the result is
|
|
still important -- it demonstrates that reading input data per se is not a
|
|
bottleneck.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Computing a CRC word by word} \label{s:crcword}
|
|
|
|
The value of $\mbox{MulByXpowD}\bigl(v(x)\bigr)$ may be computed using
|
|
multiple smaller tables instead of one table. Given that
|
|
$\deg\bigl(v(x)\bigr) < W$, $v(x)$ may be represented as a weighted
|
|
sum of polynomials $v_k(x)$ such that $\deg\bigl(v_k(x)\bigr) < B$:
|
|
\begin{align*}
|
|
v(x) = \sum_{k=0}^{K-1} v_k(x) \cdot x^{(K-1-k)B},
|
|
\end{align*}
|
|
where $K = \lceil W/B \rceil$ and
|
|
\begin{align*}
|
|
v_k(x) = \left\lfloor \frac{v(x)}{x^{(K-1-k)B}} \right\rfloor \bmod x^B.
|
|
\end{align*}
|
|
|
|
Consequently,
|
|
\begin{align}
|
|
\mbox{MulByXpowD}\bigl(v(x)\bigr)
|
|
&= v(x) \cdot x^D \bmod P(x) = \nonumber \\
|
|
&= \left(\sum_{k=0}^{K-1} v_k(x) \cdot x^{(K-1-k)B}\right) \cdot x^D \bmod P(x) = \nonumber \\
|
|
&= \sum_{k=0}^{K-1} \left(v_k(x) \cdot x^{(K-1-k)B+D} \bmod P(x)\right) = \nonumber \\
|
|
&= \sum_{k=0}^{K-1} \mbox{MulWordByXpowD}\bigl(k, v_k(x)\bigr), \label{e:crcword}
|
|
\end{align}
|
|
where the values of
|
|
\begin{align}
|
|
\mbox{MulWordByXpowD}\bigl(k, v_k(x))\bigr) = v_k(x) \cdot x^{(K-1-k)B+D} \bmod P(x) \label{e:crcwordtable}
|
|
\end{align}
|
|
may be obtained using $K$ precomputed tables. Given that
|
|
$\deg\bigl(v_k(x)\bigr) < B$, each table should contain $2^B$
|
|
entries.
|
|
|
|
A sample implementation of formulas (\ref{e:crcdefinition}),
|
|
(\ref{e:incremental}), (\ref{e:crcword}), and (\ref{e:crcwordtable}) is
|
|
given in listing \ref{l:CrcWord} using $B=8$ and assuming that $W$ is a
|
|
multiple of 8.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Computing CRC word by word},label={l:CrcWord}]
|
|
Crc CrcWord(Word value) {
|
|
Crc result = 0;
|
|
// Unroll this loop or let compiler do it.
|
|
for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
|
|
result ^= MulWordByXpowD[byte][(Byte) value];
|
|
value >>= 8;
|
|
}
|
|
return result;
|
|
}
|
|
Crc CrcWordByWord(Word *data, int n, Crc v, Crc u)
|
|
Crc crc = v ^ u;
|
|
for (int i = 0; i < n; ++i) {
|
|
Crc WordCrc = CrcWord(crc ^ data[i]);
|
|
if (sizeof(Crc) <= sizeof(Word)) {
|
|
crc = WordCrc;
|
|
} else {
|
|
crc >>= 8;
|
|
crc ^= WordCrc;
|
|
}
|
|
}
|
|
return (crc ^ u);
|
|
}
|
|
void InitWordTables() {
|
|
for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
|
|
// (K-1-k)*B + D = (W/8-1-byte)*8 + D = D - 8 + W - 8*byte.
|
|
Crc m = XpowN(D - 8 + sizeof(Word)*8 - 8*byte);
|
|
for (int i = 0; i < 256; ++i) {
|
|
MulWordByXpowD[byte][i] =MultiplyUnnormalized(i, 8, m);
|
|
}
|
|
}
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
CrcWordByWord\footnote{The variant presented in this paper is more general
|
|
than ``slicing" described in \cite{Kounavis2005\remove{,
|
|
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}}. Sample
|
|
implementation given in listing \ref{l:CrcWord} does not include one subtle
|
|
optimization implemented in \cite{Kounavis2005\remove{,
|
|
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}} as it was found
|
|
to be counter-productive.} with $W=64$ uses only 2.1-2.2 CPU cycles/byte on
|
|
modern 64-bit CPUs (our implementation is somewhat faster than the one
|
|
described in \cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
|
|
DBLP:journals/tc/KounavisB08}}). It solves the problem with data access
|
|
and, to lesser degree, allows instruction level parallelism: in the middle
|
|
of the unrolled main loop of CrcOfWord function the CPU may process
|
|
multiple bytes in parallel.
|
|
|
|
However, this solution is still imperfect -- the beginning of computation
|
|
contends for a single source of data (variable $value$), and the end of
|
|
computation contends for a single destination (variable $result$). Further
|
|
improvement requires processing of multiple independent data streams in
|
|
interleaved manner so that when computation of one data flow path is
|
|
stalled the CPU may proceed with another one.
|
|
|
|
% -----------------------------------------
|
|
\subsection{Processing non-overlapping blocks in parallel} \label{s:blockword}
|
|
|
|
Straighforward pipepiling may be achieved by spliting the input message
|
|
$M(x)=M_0(x) \ldots M_{N-1}(x)$ into $N$ blocks $M_k(x)$ of approximately
|
|
the same size and computing CRC of each block in an interleaved manner,
|
|
concatenating CRCs of individual blocks in the end. A sample implementation
|
|
is given in listing \ref{l:CrcWordBlock}.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Processing non-overlapping blocks in parallel},label={l:CrcWordBlock}]
|
|
// Processes N stripes of StripeWidth words each
|
|
// word by word, in an interleaved manner.
|
|
Crc CrcWordByWordBlocks(Word *data, Crc v, Crc u) {
|
|
assert(n % (N * StripeWidth) == 0);
|
|
// Use N local variables instead of the array.
|
|
Crc crc[N];
|
|
// Initialize the CRC value for each stripe.
|
|
crc[0] = v ^ u;
|
|
for (int stripe = 1; stripe < N; ++stripe)
|
|
crc[i] = 0 ^ u;
|
|
// Compute each stripe's CRC.
|
|
for (int i = 0; i < StripeWidth; ++i) {
|
|
// Compute multiple CRCs in interleaved manner.
|
|
Word buf[N];
|
|
for (int stripe = 0; stripe < N; ++stripe) {
|
|
buf[i] =
|
|
crc[stripe] ^ data[i + stripe * StripeWidth];
|
|
if (D > sizeof(Word) * 8) {
|
|
crc[stripe] >>= D - sizeof(Word) * 8;
|
|
} else {
|
|
crc[stripe] = 0;
|
|
}
|
|
}
|
|
for (int byte = 0; byte < sizeof(Word) / 8; ++byte) {
|
|
for (int stripe = 0; stripe < N; ++stripe) {
|
|
crc[stripe] ^=
|
|
MulWordByXpowD[byte][(Byte) buf[stripe]];
|
|
buf[stripe] >>= 8;
|
|
}
|
|
}
|
|
}
|
|
// Combine stripe CRCs.
|
|
for (int stripe = 1; stripe < N; ++stripe) {
|
|
crc[0] = ChangeStartingValue(
|
|
crc[stripe], StripeWidth, 0, crc[0]);
|
|
}
|
|
return (crc[0] ^ u);
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
A tuned implementation of $CrcWordByWordBlocks$ is capable of processing
|
|
data at 1.3-1.4 CPU cycles/byte on sufficiently large (64KB and more)
|
|
inputs, which is noticeably better that 2.1-2.2 CPU cycles/byte delivered
|
|
by word by word CRC computation. It is a good sign that it is a move in
|
|
right direction.
|
|
|
|
The drawbacks of this approach are obvious: it does not work well with
|
|
small inputs -- the cost of CRC concatentation becomes a bottleneck, -- and
|
|
it may be susceptible to false cache collisions caused by cache line
|
|
aliasing.
|
|
|
|
If the cost of CRC concatenation was not a problem, cache pressure could be
|
|
mitigated with the use of very narrow stripes. The code in question, lines
|
|
33-37 of listing \ref{l:CrcWordBlock} which combine CRCs of individual
|
|
stripes, iteratively computes
|
|
\begin{align*}
|
|
\mbox{crc}_0(x) = \mbox{crc}_k(x) + \Bigl(\mbox{crc}_0 \cdot x^{8S} \bmod P(x)\Bigr)
|
|
\end{align*}
|
|
for $k = 1, \ldots, N-1$ where $N$ and $S$ are the number and the width of
|
|
the stripes respectively. It may be rearranged as
|
|
\begin{align*}
|
|
\mbox{crc}_0(x) = \sum_{k = 0}^{N-1} \Bigl(\mbox{crc}_{K-1-k} \cdot x^{8kS} \bmod P(x)\Bigr).
|
|
\end{align*}
|
|
|
|
Explicit multiplication by $x^{8kS}$ may be avoided by moving it into preset
|
|
tables
|
|
\begin{align*}
|
|
\mbox{MulWordByXPowD}_k(n) = \mbox{MulWordByXPowD}(n) \cdot x^{kS} \bmod P(x).
|
|
\end{align*}
|
|
that are used to compute $crc'_k(x) = \mbox{crc}_k(x) \cdot x^{8kS}$, so that
|
|
\begin{align*}
|
|
\mbox{crc}_0(x) = \sum_{k = 0}^{N-1} crc'_k.
|
|
\end{align*}
|
|
|
|
Unfortunately, this approach alone does not help because
|
|
\begin{enumerate}
|
|
\item
|
|
It increases the memory footprint of MulWordByXPowD by factor of $N$.
|
|
Once the cumulative size of $\mbox{MulWordByXPowD}_k$ tables exceeds the
|
|
size of L1 cache (32-64KB), the cost of memory access to multiplication
|
|
table data increases from 3-4 CPU cycles to 15-20, eliminating all
|
|
performance gains achieved by reducing the number of table operations.
|
|
\item
|
|
It is still necessary to combine all $N$ values of $\mbox{crc}_k$ into
|
|
$\mbox{crc}_0$ at the end of the CRC computation.
|
|
\end{enumerate}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Interleaved word-by-word CRC} \label{s:multiword}
|
|
|
|
% -----------------------------------------
|
|
\subsubsection{Parallelizing CRC computation} \label{s:parallelizing}
|
|
|
|
Assume that input message $M$ is the concatenation of $K$ groups $g_k$, and
|
|
each group $g_k$ is concatenation of $N$ $W$-bit long words:
|
|
\begin{align*}
|
|
M(x) &= \sum_{k=0}^{K-1} g_k(x) \cdot x^{(K-1-k)NW}, \\
|
|
g_k(x) &= \sum_{n=0}^{N-1} m_{k, n} \cdot x^{(N-1-n)W}.
|
|
\end{align*}
|
|
|
|
Input message $M(x)$ may be represented as
|
|
\begin{align}
|
|
M(x)
|
|
&= \sum_{k=0}^{K-1} g_k(x) \cdot x^{(K-1-k)NW} = \nonumber \\
|
|
&= \sum_{k=0}^{K-1} \left(\sum_{n=0}^{N-1} m_{k, n} \cdot x^{(N-1-n)W} \right) \cdot x^{(K-1-k)NW} = \nonumber \\
|
|
&= \sum_{n=0}^{N-1} \left(\sum_{k=0}^{K-1} m_{k, n} \cdot x^{(K-1-k)NW}\right) \cdot x^{(N-1-n)W} = \nonumber \\
|
|
&= \sum_{n=0}^{N-1} M_n(x) \cdot x^{(N-1-n)W} \label{e:splitbyword}
|
|
\end{align}
|
|
where
|
|
\begin{align*}
|
|
M_n(x)
|
|
&= \sum_{k=0}^{K-1} m_{k, n} \cdot x^{(K-1-k)NW}.
|
|
\end{align*}
|
|
|
|
In other words, $M_n$ is concatenation of $n$-th $W$-bit word from $g_0$
|
|
followed by $(N-1)W$ zero bits, then $n$-th word from $g_1$ followed by
|
|
$(N-1)W$ zero bits, etc., ending up with $n$-th word from last group
|
|
$g_{K-1}$.
|
|
|
|
Appending $(N-1)W$ zero bits to $M_n$ yields $M'_n(x) = M_n(x) \cdot
|
|
x^{(N-1)W} $ which may be viewed as the concatenation of $K$ $NW$-bit
|
|
groups $f_{k}$:
|
|
\begin{align*}
|
|
M'_n(x)
|
|
&= M_n(x) \cdot x^{(N-1)W}
|
|
= \sum_{k=0}^{K-1} f_{k, n} \cdot x^{(K-1-k)NW}, \\
|
|
f_{k, n}(x) &= m_{k, n}(x) \cdot x^{(N-1)W}, \\
|
|
\end{align*}
|
|
so
|
|
\begin{align}
|
|
M(x)
|
|
&= \sum_{n=0}^{N-1} M_n(x) \cdot x^{(N-1-n)W} \nonumber \\
|
|
&= \sum_{n=0}^{N-1} M'_n(x) \cdot x^{-(N-1)W} \cdot x^{(N-1-n)W} \nonumber \\
|
|
&= \sum_{n=0}^{N-1} M'_n(x) \cdot x^{-nW}. \label{e:mdash}
|
|
\end{align}
|
|
|
|
According to (\ref{e:incremental}),
|
|
$v_{K, n}(x) = \SCRC\bigl(M'_n(x), v_{0, n}(x)\bigr)$
|
|
may be computed incrementally:
|
|
\begin{align}
|
|
v_{k+1, n}(x)
|
|
&= \SCRC\bigl(f_{k, n}(x), v_{k, n}(x)\bigr) = \nonumber \\
|
|
&= \SCRC\bigl(m_{k, n}(x) \cdot x^{(N-1)W}, v_{k, n}(x)\bigr) = \nonumber \\
|
|
&= \Bigl(v_{k, n}(x) \cdot x^{NW} + m_{k, n}(x) \cdot x^{(N-1)W} \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
|
|
&= \Bigl(v_{k, n}(x) \cdot x^W + m_{k, n}(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \label{e:crcwordnmultiply} \\
|
|
&= \mbox{CrcWordN}\bigl(m_{k, n}(x), v_{k, n}(x)\bigr). \label{e:crcwordn}
|
|
\end{align}
|
|
|
|
This approach:
|
|
\begin{enumerate}
|
|
\item
|
|
Creates $N$ independent data flows: computation of $v_{k, 0}, \ldots,
|
|
v_{k, N-1}$ may be performed truly in parallel. There are no contentions
|
|
on a single data source or destination like those the word-by-word CRC
|
|
computation described in section \ref{s:crcword} suffered from.
|
|
\item
|
|
Input data is accessed sequentially. Therefore, the load on cache
|
|
subsystem and false cache collisions are minimal. Thus, the performance
|
|
bottlenecks of approach described in \ref{s:blockword} are eliminated.
|
|
\end{enumerate}
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsubsection{Combining individual CRCs} \label{s:combine}
|
|
|
|
Once $v_{K, n}(x) = \SCRC\bigl(M'_n(x), v_{0, n}(x)\bigr)$ are computed starting with
|
|
\begin{align*}
|
|
v_{0, 0} &= v(x), \\
|
|
v_{0, n} &= 0, n \geq 1,
|
|
\end{align*}
|
|
by definition (\ref{e:simplifiedcrc}) of $\SCRC$ and relationship (\ref{e:mdash}),
|
|
\begin{align}
|
|
\SCRC\bigl(M(x), v(x)\bigr)
|
|
&= \SCRC\left(\sum_{n=0}^{N-1} M'_n(x) \cdot x^{-nW}, v(x)\right) = \nonumber \\
|
|
&= \sum_{n=0}^{N-1} \SCRC\bigl(M'_n(x) \cdot x^{-nW}, v_{0, n}(x) \bigr) = \nonumber \\
|
|
&= \sum_{n=0}^{N-1} \SCRC\bigl(M'_n(x), v_{0, n}(x) \bigr) \cdot x^{-nW} = \nonumber \\
|
|
&= \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{-nW}. \label{e:multiwordcrc1}
|
|
\end{align}
|
|
|
|
Even though this step is performed only once per input message, it still
|
|
requires $(N-1)$ non-trivial multiplications modulo $P(x)$ negatively
|
|
affecting the performance on small input messages. Also,
|
|
(\ref{e:multiwordcrc1}) uses the multiplicative inverse of $x^{nW}$ modulo
|
|
$P(x)$ which does not exists when $P(x) \bmod x = 0$.
|
|
|
|
There is more efficient and elegant solution. Assume that $M(x)$ is
|
|
followed by one more group $g_K(x)$. Then
|
|
\begin{align}
|
|
& \SCRC\bigl(M(x) \cdot x^{NW} + g_K(x), v(x)\bigr) = \nonumber \\
|
|
& = \SCRC\Bigl(g_K(x), \SCRC\bigl(M(x), v(x)\bigr)\Bigr) = \nonumber \\
|
|
& = \Bigl(\SCRC\bigl(M(x), v(x)\bigr) \cdot x^{NW} + g_K(x) \cdot x^D \Bigr) \bmod P(x) = \nonumber \\
|
|
& = \left(x^{NW} \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{-nW} + x^D \sum_{n=0}^{N-1} m_{K, n}(x) \cdot x^{(N-1-n)W} \right) \bmod P(x) = \nonumber \\
|
|
& = \left(x^{W} \sum_{n=0}^{N-1} v_{K, n}(x) \cdot x^{(N-1-n)W} + x^D \sum_{n=0}^{N-1} m_{K, n}(x) \cdot x^{(N-1-n)W} \right) \bmod P(x) \nonumber \\
|
|
& = \sum_{n=0}^{N-1} \Bigl( v_{K, n}(x) \cdot x^{W} + m_{K, n}(x) \cdot x^D\Bigr) \cdot x^{(N-1-n)W} \bmod P(x) = \label{e:additionalmultiply} \\
|
|
& = \sum_{n=0}^{N-1} \SCRC\bigl(m_{K, n}(x), v_{K, n}(x)\bigr) \cdot x^{(N-1-n)W} \bmod P(x). \label{e:additionalmultiply2}
|
|
\end{align}
|
|
|
|
(\ref{e:additionalmultiply2}) may be implemented using formula
|
|
(\ref{e:crcwordtable}) by setting $v'_0 = 0$, and then for $n = 0, \ldots,
|
|
N-1$ computing
|
|
\begin{align*}
|
|
v'_{n+1}(x)
|
|
&= \Bigl(\bigl(v'_n(x) + v_{K, n}\bigr) \cdot x^W + m_{K, n} \cdot x^D\Bigr) \bmod P(x) \\
|
|
&= \SCRC\bigl(m_{K,n}, v'_n(x) + v_{K, n} \bigr).
|
|
\end{align*}
|
|
|
|
Alternatively, this step may be performed using the less efficient
|
|
technique described in section \ref{s:crcbyteword}.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsubsection{Efficient computation of individual CRCs} \label{s:compute}
|
|
|
|
Given $v(x)$, $\deg\bigl(v(x)\bigr) < D$ and $m(x)$, $\deg\bigl(m(x)\bigr) < W$,
|
|
\begin{align*}
|
|
\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)
|
|
&= \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x)
|
|
\end{align*}
|
|
may be implemented efficiently utilizing the techniques described in
|
|
sections \ref{s:crcbyte}, \ref{s:crcbyteword}, and \ref{s:crcword}. When $D
|
|
\leq W$,
|
|
\begin{align*}
|
|
\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)
|
|
&= \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \\
|
|
&= \Bigl(v(x) \cdot x^{W-D} + m(x) \Bigr) \cdot x^{(N-1)W + D} \bmod P(x),
|
|
\end{align*}
|
|
and may be implemented using the table-driven multiplication as described
|
|
in (\ref{e:crcwordtable}) except that the operand is multiplied by
|
|
$x^{(N-1)W+D}$ instead of $x^D$. Like in (\ref{e:crcbytetable2}), explicit
|
|
multiplication of $v(x)$ by $x^{W-D}$ is not required since $D$-normalized
|
|
representation of $v(x)$, viewed as a $W$-normalized representation, is
|
|
equal to $\left(v(x) \cdot x^{W-D}\right)$.
|
|
|
|
Using the same technique as in formula (\ref{e:crcbyte}), for $D \geq W$
|
|
let
|
|
\begin{align*}
|
|
v_H(x) &= \left\lfloor\frac{v(x)}{x^{D-W}}\right\rfloor,
|
|
& \deg\bigl(v_H(x)\bigr) &< W, \\
|
|
v_L(x) &= v(x) \bmod x^{D-W},
|
|
& \deg\bigl(v_L(x)\bigr) &< D-W,
|
|
\end{align*}
|
|
so that $v(x) = v_L(x) + v_H(x) \cdot x^{D-W}$. Then,
|
|
\begin{align}
|
|
& \mbox{CrcWordN}\bigl(m(x), v(x)\bigr) = \nonumber \\
|
|
& = \Bigl(v(x) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
|
|
& = \Bigl(\bigl(v_L(x) + v_H(x) \cdot x^{D-W}\bigr) \cdot x^W + m(x) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
|
|
& = \Bigl(v_L(x) \cdot x^W + \bigl(v_H(x) + m(x)\bigr) \cdot x^D \Bigr) \cdot x^{(N-1)W} \bmod P(x) = \nonumber \\
|
|
& = \Bigl(\bigl(v_H(x) + m(x)\bigr) \cdot x^{(N-1)W + D} \bmod P(x)\Bigr) + \nonumber \\
|
|
& + \Bigl(\bigl(v_L(x) \cdot x^W \bigr) \cdot x^{(N-1)W} \bmod P(x)\Bigr). \label{e:crcwordinterleaved}
|
|
\end{align}
|
|
|
|
Since $\deg\bigl(v_H(x)+m(x)\bigr) < W$, the first summand of
|
|
$\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)$,
|
|
\begin{align*}
|
|
\Bigl(\bigl(v_H(x) + m(x)\bigr) \cdot x^{(N-1)W + D} \bmod P(x)\Bigr),
|
|
\end{align*}
|
|
may be computed using the table-driven multiplication technique described
|
|
in (\ref{e:crcwordtable}) except that the operand is multiplied by
|
|
$x^{D+(N-1)W}$ instead of $x^D$.
|
|
|
|
Computation of the second summand of $\mbox{CrcWordN}\bigl(m(x), v(x)\bigr)$,
|
|
\begin{align*}
|
|
\Bigl(\bigl(v_L(x) \cdot x^W \bigr) \cdot x^{(N-1)W} \bmod P(x)\Bigr),
|
|
\end{align*}
|
|
is somewhat less intuitive. Since $\deg\bigl(v_L(x)\bigr) < D-W$,
|
|
\begin{align*}
|
|
\left(v_L(x) \cdot x^W\right) \bmod P(x) = \left(v_L(x) \cdot x^W\right),
|
|
\end{align*}
|
|
and may be computed by shifting $v_L(x)$ by $W$ bits. Additional
|
|
multiplication by $x^{(N-1)W}$ is accomplished by adding $\bigl(v_L(x)
|
|
\cdot x^W\bigr)$, produced at step $n < N-1$ of the algorithm described by
|
|
formula (\ref{e:crcwordn}), to the value of $v_{k, n+1}(x)$ which will be
|
|
additionally multiplied by $x^{(N-1)W}$ as shown in formula
|
|
(\ref{e:crcwordnmultiply}).
|
|
|
|
For $n=N-1$, the value of $\bigl(v_L(x) \cdot x^W\bigr)$ should be added to
|
|
the value of $v_{k+1, n'}(x)$ where $n' = 0$. For $k < K$, it will be
|
|
multiplied by $x^{(N-1)W}$ during next round of parallel computation as
|
|
shown in (\ref{e:crcwordnmultiply}). For $k = K$, $v_{k+1, n'}(x)$ will be
|
|
multiplied by $x^{(N-1)W}$ during CRC concatenation as shown in
|
|
(\ref{e:additionalmultiply}) since $n'=0$.
|
|
|
|
\begin{figure}
|
|
\begin{lstlisting}[caption={Interleaved, word by word CRC computation},label={l:CrcMultiword}]
|
|
Crc CrcInterleavedWordByWord(
|
|
Word *data, int blocks, Crc v, Crc u) {
|
|
Crc crc[N+1] = {0};
|
|
crc[0] = v ^ u;
|
|
for (int i = 0; i < N*(blocks - 1); i += N) {
|
|
Word buffer[N];
|
|
// Load next N words and move overflow
|
|
// bits into "next" word.
|
|
for (int n = 0; n < N; ++n) {
|
|
buffer[N] = crc[n] ^ data[i + n];
|
|
if (D > sizeof(Word) * 8)
|
|
crc[n+1] ^= crc[n] >> (sizeof(Word) * 8);
|
|
crc[n] = 0;
|
|
}
|
|
// Compute interleaved word-by-word CRC.
|
|
for (int byte = 0; byte < sizeof(Word); ++byte) {
|
|
for (int n = 0; n < N; ++n) {
|
|
crc[n] ^=
|
|
MulInterleavedWordByXpowD[byte][(Byte) buffer[n]];
|
|
buffer[n] >>= 8;
|
|
}
|
|
}
|
|
// Combine crc[0] with delayed overflow bits.
|
|
crc[0] ^= crc[N];
|
|
crc[N] = 0;
|
|
}
|
|
// Process the last N bytes and combine CRCs.
|
|
for (int n = 0; n < N; ++n) {
|
|
if (n != 0) crc[0] ^= crc[n];
|
|
Crc WordCrc = CrcOfWord(crc[0] ^ data[i + n]);
|
|
if (D > sizeof(Word) * 8) {
|
|
crc[0] >>= D - sizeof(Word) * 8;
|
|
crc[0] ^= WordCrc;
|
|
} else {
|
|
crc[0] = WordCrc;
|
|
}
|
|
}
|
|
return (crc[0] ^ u);
|
|
}
|
|
void InitInterleavedWordTables(void) {
|
|
for (int byte = 0; byte < sizeof(Word); ++byte) {
|
|
Crc m = XpowN(D - 8 + N*sizeof(Word)*8 - 8*byte);
|
|
for (int i = 0; i < 256; ++i) {
|
|
MulInterleavedWordByXpowD[byte][i] =
|
|
MultiplyUnnormalized(i, 8, m);
|
|
}
|
|
}
|
|
}
|
|
\end{lstlisting}
|
|
\end{figure}
|
|
|
|
|
|
% -----------------------------------------
|
|
\section{Experimental results}
|
|
|
|
The tests were performed using Intel Q9650 3.0GHz CPU, DDR2-800 memory with
|
|
4-4-4-12 timing, and a motherboard with an Intel P45 chipset.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Testing methology}
|
|
|
|
All tests were performed using random input data over various block sizes.
|
|
The code for all evaluated algorithms was heavily optimized. Tests were
|
|
performed on both aligned and non-aligned input data to ensure that
|
|
misaligned inputs do not carry performance penalty. CRC tables were aligned
|
|
on 256-byte boundary.
|
|
|
|
Tests were performed with warm data and warm CRC tables: as shown in
|
|
\cite{Kounavis2005\remove{, DBLP:conf/iscc/KounavisB05,
|
|
DBLP:journals/tc/KounavisB08}}, the footprint of CRC tables -- as long as
|
|
they fit into L1 cache -- is not a major contributor to the performance.
|
|
|
|
Performance was measured in number of CPU cycles per byte of input data:
|
|
apparently, performance of CRC computation is bounded by performance of CPU
|
|
and its L1 cache latency. Spot testing of few other Intel and AMD CPU
|
|
models showed little variation in performance measured in CPU cycles per
|
|
byte despite substential differences in CPU clock frequencies.
|
|
|
|
To minimize performance variations caused by interference with OS and other
|
|
applications (context switches, CPU migrations, CPU cache flushes, memory
|
|
bus interference from other processes, etc.), the test applications were
|
|
run at high priority, each test was executed multiple times, and the
|
|
minimum time was measured. That allowed the tests to achieve repeatability
|
|
within $\pm 1\%$.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Compiler comparison}
|
|
|
|
Despite CRC code being rather straightforward, there were surprises (see
|
|
tables \ref{t:CompilerComparison128} and \ref{t:CompilerComparison64}).
|
|
|
|
On 64-bit AMD64 platform, Microsoft CL compiler (version 15.00.30729)
|
|
consistently and noticeably generated the fastest code using
|
|
general-purpose integer arithmetics (64-bit and smaller CRCs) -- 1.23 times
|
|
faster than the code generated by Intel's ICL 11.10.051 and 1.49 times
|
|
faster than the code generated by GCC 4.5.0. A tuned, hand-written inline
|
|
assembler code for CRC-32 and CRC-64 for GCC was as fast as the code
|
|
generated by CL.
|
|
|
|
When it comes to arithmetics with the use of SSE2 intrinsic functions on
|
|
64-bit AMD64 platform for 128-bit CRC, the code generated by GCC 4.5.0
|
|
consistenly outperformed the code generated by Microsoft and Intel
|
|
compilers -- by a factor of 1.21 and 1.33 respectively. However, earlier
|
|
versions of GCC did not produce efficient SSE2 code either. For that
|
|
reason, pre-4.5.0 versions of GCC use hand-written inline assember code
|
|
which was as fast as the code generated by GCC 4.5.0.
|
|
|
|
Neither compiler was able to generate efficient code on 32-bit bit I386
|
|
platform. Performance of the code that used MMX intrinsic functions was
|
|
better but still not as good as hand-written assember, which was provided
|
|
for all compilers.
|
|
|
|
The fastest code for 128-bit CRC on I386 platform was generated by GCC
|
|
4.5.0.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Choice of interleave level}
|
|
|
|
Number of data streams processed by interleaved, word-by-word CRC
|
|
computation described in section \ref{s:multiword} should matter. Too few
|
|
means underutilization of available ALUs. Too many will increase the length
|
|
of the main loop and stress instruction decoders, and may cause splilling
|
|
of registers containing hot data (interleaved processing of $N$ words of
|
|
data uses at least $(2N+2)$ registers).
|
|
|
|
As table \ref{t:MultiwordPerfByStripe} shows, the optimal number of
|
|
interleaved data streams on modern Intel and AMD CPUs for integer
|
|
arithmetics is either 3 or 4 (likely because they all have exactly 3 ALUs).
|
|
However, for SSE2 arithmetics on AMD64 platform the optimal number of
|
|
streams is 6 (3 on I386), which is quite counter-intuitive result as it does
|
|
not correlate with the number of available ALUs. Good old performance
|
|
mantra "you need to measure" still applies.
|
|
|
|
|
|
% -----------------------------------------
|
|
\subsection{Performance of CRC algorithms}
|
|
|
|
Average performance of best variants of CRC algorithms for 64-bit AMD64 and
|
|
32-bit I386 platforms processing 1KB, 2KB, \ldots, 1MB inputs is given in
|
|
tables \ref{t:AveragePerformance64} and \ref{t:AveragePerformance64}
|
|
respectively. Proposed interleaved multiword CRC algorithm is 1.7-2.0 times
|
|
faster that current state of the art ``slicing''.
|
|
|
|
As demonstrated in tables \ref{t:CRC64Perf} and \ref{t:CRC32Perf},
|
|
interleaved word-by-word CRC described in section \ref{s:multiword},
|
|
running at 1.2 CPU cycles/byte, is 1.8 times faster than 2.1 CPU
|
|
cycles/byte achieved by current state of the art word-by-word CRC algorithm
|
|
(``slicing") described in \cite{Kounavis2005\remove{,
|
|
DBLP:conf/iscc/KounavisB05, DBLP:journals/tc/KounavisB08}}.
|
|
|
|
On 64-bit AMD64 platform, the best performance was achieved using 64-bit
|
|
reads and 64-bit tables for all variants of $N$-bit CRC for $N \leq 64$. In
|
|
particular, tables \ref{t:CRC64Perf} and \ref{t:CRC32Perf} clearly show
|
|
that performance of 32-bit and 64-bit CRCs is nearly identical.
|
|
Consequently, there is no reason to favor CRC-32 over CRC-64 for
|
|
performance reasons.
|
|
|
|
The use of MMX on the 32-bit I386 platform allowed to utilize 64-bit tables
|
|
and 64-bit reads achieving 1.3 CPU cyles/byte. Neither compiler generated
|
|
efficient code using MMX intrinsic functions, so inline assembler was used.
|
|
|
|
With the use of SSE2 intrinsics on AMD64 architecture, 128-bit CRC may be
|
|
computed takes at 1.7 CPU cycles/byte using the new algorithm (see
|
|
table \ref{t:CRC128PerfMultiword}), compared with 2.9 CPU cycles/byte
|
|
achieved by word-by-word CRC computation (see table
|
|
\ref{t:CRC128PerfSlicing}). On the 32-bit I386 architecture, the use of SSE2
|
|
intrinsics and GCC 4.5.0 allowed the computation of 128-bit CRC at 2.1 CPU
|
|
cycles/byte, compared with 4.2 CPU cycles/byte delivered by
|
|
word-by-word algorithm.
|
|
|
|
Given that MD5 computation takes 6.8-7.1 CPU cycles/byte and SHA-1 takes
|
|
7.6-7.9 CPU cycles per byte, CRCs are still the algorithm of choice for
|
|
data corruption detection.
|
|
|
|
|
|
|
|
% -----------------------------------------
|
|
\bibliographystyle{alpha}
|
|
\bibliography{crc}
|
|
|
|
% -----------------------------------------
|
|
\appendix
|
|
\cleardoublepage
|
|
|
|
|
|
|
|
% -----------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
|
|
\caption{CRC performance, AMD64 platform} \label{t:AveragePerformance64}
|
|
\begin{tabular}{| l | c | c | c |}
|
|
\hline
|
|
Method & Slicing$^1$ & Multiword$^2$ & Improvement \\
|
|
\hline
|
|
CRC-32 & $2.08^3$ & $1.16^{4,5}$ & 1.79 \\
|
|
CRC-64 & $2.09^3$ & $1.16^{4,5}$ & 1.79 \\
|
|
CRC-128 & $2.91^4$ & $1.68^{4,6}$ & 1.73 \\
|
|
\hline
|
|
\end{tabular}
|
|
{}
|
|
|
|
\caption{CRC performance, I386 platform} \label{t:AveragePerformance32}
|
|
\begin{tabular}{| l | c | c | c |}
|
|
\hline
|
|
Method & Slicing$^1$ & Multiword$^2$ & Improvement \\
|
|
\hline
|
|
CRC-32 & $2.52^3$ & $1.29^{3,7}$ & 1.96 \\
|
|
CRC-64 & $3.28^3$ & $1.29^{3,7}$ & 2.55 \\
|
|
CRC-128 & $4.17^4$ & $2.10^{4,8}$ & 1.98 \\
|
|
\hline
|
|
\end{tabular}
|
|
{}
|
|
\end{center}
|
|
|
|
|
|
|
|
The best average number of CPU cycles per byte processing 1KB-1MB inputs.
|
|
Warm data, warm tables.
|
|
|
|
$^1$ {\it``Slicing"} implements the algorithm described in section
|
|
\ref{s:crcword}.
|
|
|
|
$^2$ {\it``Multiword/$N$"} implements algorithm described in section
|
|
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
|
|
manner.
|
|
|
|
$^3$ Microsoft CL 15.00.30729 compiler, ``-O2" flag.
|
|
|
|
$^4$ GCC 4.5.0 compiler, ``-O3" flag.
|
|
|
|
$^5$ Multiword/$N=4$, hand-written inline assembler.
|
|
|
|
$^6$ Multiword/$N=6$, C++.
|
|
|
|
$^7$ Multiword/$N=4$, hand-written MMX inline assember.
|
|
|
|
$^8$ Multiword/$N=3$, C++.
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
% --------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
\caption{Interleaved multiword CRC: choosing the number of stripes $N$}
|
|
|
|
\label{t:MultiwordPerfByStripe}
|
|
\begin{tabular}{| l | l | c | c | c | c | c | c | c |}
|
|
\hline
|
|
CRC & Platform & N=2 & N=3 & N=4 & N=5 & N=6 & N=7 & N=8 \\
|
|
\hline
|
|
CRC-64$^9$ & AMD64 & 1.42 & 1.23 & {\bf 1.17} & 1.46 & 2.08 & 2.59 & 2.73 \\
|
|
CRC-128$^{10}$ & AMD64 & 2.07 & 1.84 & 1.76 & 1.70 & {\bf 1.68} & 1.75 & 1.79 \\
|
|
CRC-128$^{10}$ & I386 & 2.56 & {\bf 2.10} & 2.46 & 2.61 & 2.52 & 2.62 & 2.57 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
{}
|
|
|
|
Average number of CPU cycles per byte processing 1KB, 2KB, \ldots, 1MB
|
|
inputs. Interleaved word-by-word CRC computation as described in section
|
|
\ref{s:multiword}. Warm data, warm tables.
|
|
|
|
$^9$ Microsoft CL 15.00.30729 compiler, AMD64 platform, C++ code.
|
|
|
|
$^{10}$ GCC 4.5.0 compiler, AMD64 platform, C++ code.
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
% -----------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
|
|
\caption{Compiler comparison: Multiword/N, 64-bit CRC} \label{t:CompilerComparison64}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & N & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
GCC/C++ & 3 & 2.11 & 1.84 & 1.76 & 1.74 & 1.75 & 1.75 & 1.75 & 1.76 \\
|
|
ICL & 3 & 2.35 & 1.65 & 1.48 & 1.44 & 1.44 & 1.45 & 1.45 & 1.45 \\
|
|
CL & 4 & 1.75 & 1.29 & 1.18 & 1.15 & 1.17 & 1.18 & 1.18 & 1.18 \\
|
|
GCC/ASM & 4 & 1.65 & 1.26 & 1.17 & 1.15 & 1.16 & 1.17 & 1.17 & 1.17 \\
|
|
\hline
|
|
\end{tabular}
|
|
{}
|
|
|
|
\caption{Compiler comparison: Multiword/N, 128-bit CRC} \label{t:CompilerComparison128}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & N & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
CL & 5 & 4.08 & 2.56 & 2.43 & 2.25 & 2.20 & 2.19 & 2.18 & 2.20 \\
|
|
ICL & 5 & 3.52 & 2.33 & 2.23 & 2.05 & 2.00 & 1.99 & 1.99 & 2.01 \\
|
|
GCC & 6 & 2.90 & 1.93 & 1.85 & 1.72 & 1.65 & 1.63 & 1.63 & 1.63 \\
|
|
\hline
|
|
\end{tabular}
|
|
{}
|
|
\end{center}
|
|
|
|
Number of CPU cycles per byte, best code for given compiler and CRC.
|
|
|
|
64-bit CRC (CRC-64-ECMA-182 polynomial) and 128-bit CRC (CRC-128/IEEE
|
|
polynomial) respectively. 64-bit platform, 64-bit reads. Warm data, warm
|
|
tables.
|
|
|
|
Microsoft CL 15.00.30729 compiler was used with ``-O2" flag. Intel ICL
|
|
11.10.051 and GCC 4.5.0 were used with ``-O3" flag.
|
|
|
|
\begin{center}
|
|
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CompilerComparison.pdf} \label{f:CompilerComparison}
|
|
\end{center}
|
|
|
|
{\it``Multiword/$N$"} implements algorithm described in section
|
|
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
|
|
manner.
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
% --------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
|
|
\caption{CRC-32 performance} \label{t:CRC32Perf}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
Sarwate & 6.61 & 6.62 & 6.70 & 6.68 & 6.67 & 6.66 & 6.67 & 6.75 \\
|
|
Black & 5.44 & 5.46 & 5.47 & 5.48 & 5.47 & 5.46 & 5.47 & 5.53 \\
|
|
Slicing & 2.15 & 2.10 & 2.09 & 2.09 & 2.08 & 2.08 & 2.08 & 2.10 \\
|
|
Blockword/3 & 2.27 & 2.14 & 2.15 & 2.13 & 2.13 & 1.55 & 1.39 & 1.31 \\
|
|
Multiword/4 & 1.75 & 1.29 & 1.18 & 1.16 & 1.17 & 1.18 & 1.18 & 1.18 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
Number of CPU cycles per byte. 32-bit CRC (CRC-32C polynomial), 64-bit
|
|
platform, 64-bit tables, 64-bit reads (except Sarwate). Microsoft CL
|
|
15.00.30729 compiler. Warm data, warm tables.
|
|
|
|
\begin{center}
|
|
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC32-full.pdf} \label{f:CRC32Perf}
|
|
\end{center}
|
|
|
|
{\it``Sarwate"} implements the algorithm described in section
|
|
\ref{s:crcbyte}.
|
|
|
|
{\it``Black"} implements the algorithm described in section
|
|
\ref{s:crcbyteword}.
|
|
|
|
{\it``Slicing"} implements the algorithm described in section
|
|
\ref{s:crcword}.
|
|
|
|
{\it``Blockword/3"} implements the algorithm described in section
|
|
\ref{s:blockword} with 3 stripes of 15,376 bytes each.
|
|
|
|
{\it``Multiword/4"} implements the algorithm described in section
|
|
\ref{s:multiword} processing 4 data streams in parallel in interleaved
|
|
manner.
|
|
|
|
|
|
\end{table}
|
|
|
|
|
|
% --------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
|
|
\caption{CRC-64 performance} \label{t:CRC64Perf}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
Sarwate & 6.61 & 6.62 & 6.70 & 6.68 & 6.67 & 6.65 & 6.66 & 6.75 \\
|
|
Black & 5.44 & 5.46 & 5.47 & 5.47 & 5.47 & 5.47 & 5.47 & 5.53 \\
|
|
Slicing & 2.16 & 2.08 & 2.09 & 2.10 & 2.08 & 2.08 & 2.08 & 2.09 \\
|
|
Blockword/3 & 2.27 & 2.14 & 2.15 & 2.13 & 2.13 & 1.59 & 1.41 & 1.33 \\
|
|
Multiword/4 & 1.75 & 1.29 & 1.18 & 1.15 & 1.17 & 1.18 & 1.18 & 1.18 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
Number of CPU cycles per byte. 64-bit CRC (CRC-64-ECMA-182 polynomial),
|
|
64-bit platform, 64-bit tables, 64-bit reads (except Sarwate). Microsoft CL
|
|
15.00.30729 compiler. Warm data, warm tables.
|
|
|
|
\begin{center}
|
|
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC64-small.pdf} \label{f:CRC64Perf}
|
|
\end{center}
|
|
|
|
{\it``Sarwate"} implements the algorithm described in section
|
|
\ref{s:crcbyte}.
|
|
|
|
{\it``Black"} implements the algorithm described in section
|
|
\ref{s:crcbyteword}.
|
|
|
|
{\it``Slicing"} implements the algorithm described in section
|
|
\ref{s:crcword}.
|
|
|
|
{\it``Blockword/3"} implements the algorithm described in section
|
|
\ref{s:blockword} with 3 stripes of 15,376 bytes each.
|
|
|
|
{\it``Multiword/4"} implements the algorithm described in section
|
|
\ref{s:multiword} processing 4 data streams in parallel in interleaved
|
|
manner.
|
|
|
|
|
|
\end{table}
|
|
|
|
% --------------------------------------
|
|
\begin{table}
|
|
\begin{center}
|
|
|
|
|
|
\caption{CRC-128 performance: Slicing CRC} \label{t:CRC128PerfSlicing}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
CL/SSE2 & 4.02 & 3.81 & 4.01 & 4.05 & 4.13 & 4.18 & 4.20 & 4.24 \\
|
|
ICL/SSE2 & 3.40 & 3.24 & 3.57 & 3.59 & 3.68 & 3.72 & 3.75 & 3.81 \\
|
|
GCC/UINT & 3.45 & 3.24 & 3.36 & 3.48 & 3.61 & 3.64 & 3.67 & 3.72 \\
|
|
GCC/SSE2 & 2.67 & 2.48 & 2.63 & 2.79 & 2.97 & 2.99 & 2.99 & 3.03 \\
|
|
\hline
|
|
\end{tabular}
|
|
|
|
|
|
\caption{CRC-128 performance: Multiword CRC} \label{t:CRC128PerfMultiword}
|
|
\begin{tabular}{| l | c | c | c | c | c | c | c | c |}
|
|
\hline
|
|
Input size & 64 & 256 & 1K & 4K & 16K & 64K & 256K & 1M \\
|
|
\hline
|
|
GCC/UINT/3 & 3.83 & 3.02 & 3.04 & 3.01 & 3.00 & 2.98 & 2.98 & 3.00 \\
|
|
CL/SSE2/5 & 4.08 & 2.56 & 2.43 & 2.25 & 2.20 & 2.19 & 2.18 & 2.20 \\
|
|
ICL/SSE2/5 & 3.52 & 2.33 & 2.23 & 2.05 & 2.00 & 1.99 & 1.99 & 2.01 \\
|
|
GCC/SSE2/6 & 2.90 & 1.93 & 1.85 & 1.72 & 1.65 & 1.63 & 1.63 & 1.63 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
Number of CPU cycles per byte. 128-bit CRC (CRC-128/IEEE polynomial),
|
|
64-bit platform, 128-bit tables, 64-bit reads. Warm data, warm tables.
|
|
|
|
All compilers were tested using SSE2 intrinsics (/SSE2 variants). GCC was
|
|
also tested using 128-bit integers provided by the compiler (GCC/UINT).
|
|
|
|
\begin{center}
|
|
\includegraphics[trim=14.25mm 50mm 16.75mm 50mm, width=0.99\textwidth]{CRC128-full.pdf} \label{f:CRC128Perf}
|
|
\end{center}
|
|
|
|
{\it``Slicing"} implements algorithm described in section \ref{s:crcword}.
|
|
|
|
{\it``Multiword/$N$"} implements algorithm described in section
|
|
\ref{s:multiword} processing $N$ data streams in parallel in interleaved
|
|
manner. The optimal (for given compiler) value of $N$ was used.
|
|
|
|
\end{table}
|
|
|
|
\end{document}
|