Slow performance of AddEntry method for big archives (more then 200K) #195

New Issue

claunia · 2026-01-29T22:08:07Z

claunia commented

2026-01-29 22:08:07 +00:00

Originally created by @KvanTTT on GitHub (Jun 4, 2017).

Is it possible to get rid of DoesKeyMatchExisting check for AddAllFromDirectory method? It leads to non-linear time complexity.

Originally created by @KvanTTT on GitHub (Jun 4, 2017). Is it possible to get rid of [DoesKeyMatchExisting](https://github.com/adamhathcock/sharpcompress/blob/0.17.0/src/SharpCompress/Archives/AbstractWritableArchive.cs#L93) check for `AddAllFromDirectory` method? It leads to non-linear time complexity.

claunia added the question label 2026-01-29 22:08:07 +00:00

claunia closed this issue

2026-01-29 22:08:07 +00:00

claunia commented

2026-01-29 22:08:07 +00:00

@adamhathcock commented on GitHub (Jun 4, 2017):

Might be possible to refactor how to check for duplicates but I don't see how this is a real performance problem relative to the act of actually writing/reading archives.

@adamhathcock commented on GitHub (Jun 4, 2017): Might be possible to refactor how to check for duplicates but I don't see how this is a real performance problem relative to the act of actually writing/reading archives.

claunia commented

2026-01-29 22:08:08 +00:00

@KvanTTT commented on GitHub (Jun 4, 2017):

Maybe I'm wrong about performance decreasing. Need further investigation but 7zip compress big archives much faster.

@KvanTTT commented on GitHub (Jun 4, 2017): Maybe I'm wrong about performance decreasing. Need further investigation but 7zip compress big archives much faster.

claunia commented

2026-01-29 22:08:08 +00:00

@adamhathcock commented on GitHub (Jun 4, 2017):

It probably would always be faster as it's a native binary. Try ZipWriter directly if you want something faster.

@adamhathcock commented on GitHub (Jun 4, 2017): It probably would always be faster as it's a native binary. Try ZipWriter directly if you want something faster.

claunia commented

2026-01-29 22:08:08 +00:00

@KvanTTT commented on GitHub (Jul 3, 2017):

Thanks, ZipWriter works much faster!

@KvanTTT commented on GitHub (Jul 3, 2017): Thanks, `ZipWriter` works much faster!

claunia commented

2026-01-29 22:08:08 +00:00

@iCodeIT commented on GitHub (Oct 3, 2019):

@adamhathcock the problem with using List<> collections is that searching through them is slow, and when you try to make archives that contain 200.000 files, it gets very slow. It's an O(n^2) performance. We tried waiting more than 10minutes because of this at which point we killed the program.
If you used some sort of hash-collection instead (dictionaries comes to mind) you would have O(1) behavior, and that part of the program would run in seconds.

@iCodeIT commented on GitHub (Oct 3, 2019): @adamhathcock the problem with using List<> collections is that searching through them is slow, and when you try to make archives that contain 200.000 files, it gets very slow. It's an O(n^2) performance. We tried waiting more than 10minutes because of this at which point we killed the program. If you used some sort of hash-collection instead (dictionaries comes to mind) you would have O(1) behavior, and that part of the program would run in seconds.

claunia commented

2026-01-29 22:08:09 +00:00

@adamhathcock commented on GitHub (Oct 3, 2019):

Lists are O(n) while Dictionaries would be O(log n) and then you have to be indexing/hashing by a certain value. I’m assuming you mean the Entry Key is what you want to search by.

I’m afraid you’ll need to be more specific with your scenario and possibly some code. A cached list of 200k strings still shouldn’t take ten minutes for a single match.

@adamhathcock commented on GitHub (Oct 3, 2019): Lists are O(n) while Dictionaries would be O(log n) and then you have to be indexing/hashing by a certain value. I’m assuming you mean the Entry Key is what you want to search by. I’m afraid you’ll need to be more specific with your scenario and possibly some code. A cached list of 200k strings still shouldn’t take ten minutes for a single match.

claunia commented

2026-01-29 22:08:09 +00:00

@adamhathcock commented on GitHub (Oct 3, 2019):

Rereading the original issue and just now thinking about trying to create an archive with a very large amount of entries like this would be slow.

A simple PR to create a hashset of the keys would be doable for creation.

@adamhathcock commented on GitHub (Oct 3, 2019): Rereading the original issue and just now thinking about trying to create an archive with a very large amount of entries like this would be slow. A simple PR to create a hashset of the keys would be doable for creation.

claunia commented

2026-01-29 22:08:09 +00:00

@iCodeIT commented on GitHub (Oct 3, 2019):

So the problem is that each file calls AddEntry, which in turn checks if the entry already exists.
So the problem is not checking one entry - the problem is that each entry is checked against all existing entries. And that is slow. Further RebuildModifiedCollection is called for each entry, which in turn calls AddRange, which is also kind of slow.
I will see if I have time to look at this and if successful I will create a pull request.

BTW: Instead of searching the list of a specific element, with a dictionary you can use ContainsKey, which is a O(1) operation: http://people.cs.aau.dk/~normark/oop-csharp/html/notes/collections-note-time-complexity-dictionaries.html
So the for adding n entries, it will be O(n)

@iCodeIT commented on GitHub (Oct 3, 2019): So the problem is that each file calls AddEntry, which in turn checks if the entry already exists. So the problem is not checking one entry - the problem is that each entry is checked against all existing entries. And that is slow. Further RebuildModifiedCollection is called for each entry, which in turn calls AddRange, which is also kind of slow. I will see if I have time to look at this and if successful I will create a pull request. BTW: Instead of searching the list of a specific element, with a dictionary you can use ContainsKey, which is a O(1) operation: http://people.cs.aau.dk/~normark/oop-csharp/html/notes/collections-note-time-complexity-dictionaries.html So the for adding n entries, it will be O(n)

claunia commented

2026-01-29 22:08:10 +00:00

@iCodeIT commented on GitHub (Oct 3, 2019):

It actuall seems like it is the RebuildModifiedCollection that takes the most time as it is called for each AddEntry call. So the AddRange will add 1, then 2, then 3 etc.
For 200.000 entries, that is the a total number of 1+2+3+...+199.999+200.000 items added (the number is 20.000.100.000) and that is a problem.

I will focus on writing a fix for that instead.

@iCodeIT commented on GitHub (Oct 3, 2019): It actuall seems like it is the RebuildModifiedCollection that takes the most time as it is called for each AddEntry call. So the AddRange will add 1, then 2, then 3 etc. For 200.000 entries, that is the a total number of 1+2+3+...+199.999+200.000 items added (the number is 20.000.100.000) and that is a problem. I will focus on writing a fix for that instead.

claunia commented

2026-01-29 22:08:10 +00:00

@iCodeIT commented on GitHub (Oct 3, 2019):

Did a test with 45.995 files. With profiler running it took the current implementation 101.141ms to add all entries. With a minor modification it takes 4.230ms and all unit tests still pass.
Will clean up code and make a PR later today.
This shows the profiled difference in timing.

@iCodeIT commented on GitHub (Oct 3, 2019): Did a test with 45.995 files. With profiler running it took the current implementation 101.141ms to add all entries. With a minor modification it takes 4.230ms and all unit tests still pass. Will clean up code and make a PR later today. This shows the profiled difference in timing. ![image](https://user-images.githubusercontent.com/2152555/66124243-2a825700-e5e4-11e9-9c8d-9d93353e42d1.png)

claunia commented

2026-01-29 22:08:10 +00:00

@iCodeIT commented on GitHub (Oct 6, 2019):

Added https://github.com/adamhathcock/sharpcompress/pull/484

@iCodeIT commented on GitHub (Oct 6, 2019): Added https://github.com/adamhathcock/sharpcompress/pull/484

claunia referenced this issue

2026-01-29 22:18:04 +00:00

[PR #195] Strong-name both the main and test projects #890

Sign in to join this conversation.

Branches Tags

master

copilot/fix-rar-extraction-issues

copilot/add-lzwreader-support

adam/add-alternate-compressions

adam/cleanup-options

copilot/fix-openentrystreamasync-memory-issue

release

adam/more-explode-async

copilot/fix-infinite-loop-rar-archive

adam/data-descriptor-fix

adam/fix-tests-with-proper-rewind

copilot/fix-data-descriptor-stream-bug

adam/lmza-investigation

adam/create-rar-async

adam/async-rar2

copilot/support-multi-threading-path

copilot/sub-pr-1132-again

adam/memory-perf

copilot/add-performance-benchmarking

copilot/sub-pr-1121

copilot/add-password-support-zip-files

copilot/add-so-optimized-zip-support

adam/rar-async-only

copilot/add-buffered-stream-async-read

copilot/sub-pr-1076

copilot/fix-decompression-exception

copilot/fix-archivefactory-issue

copilot/rationalize-sourcestream-volumes

adam/open-async

copilot/add-ace-archive-support

copilot/sub-pr-1040-again

adam/more-async-3

copilot/fix-tararchive-incomplete-iteration

adam/multi-threaded

copilot/sub-pr-1040

adam/awesome-copilot

copilot/fix-ziparchive-extraction-issue

copilot/fix-tararchive-open-crash

copilot/fix-tar-xz-file-reading-issue

copilot/setup-copilot-instructions

copilot/fix-decompression-performance-issue

copilot/convert-stream-access-to-async

adam/enable-agent

adam/async-deflate

adam/async-rar

adam/more-cleanup

adam/zstd

async-2

zstandard

net461-tests

dmg

async

build-netcore3

recycle-memory-stream

presentation

pax

netcore2

zip_encryption

dotnet-tool

tar_redux

native_zlib

Issue-197

system_buffers

TarNames

7zip_sfx

portable_crypto

WinRT

new_7zip

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/sharpcompress#195