Source data:
- ~500,000 folders (court cases)
- ~2.5-3 million documents
- Source drives is replicated x2 with RAID
- Copying to NAS over GB ethernet
- Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
- Final copy took ~24 hours
Monitoring:
- Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
- CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
- In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
- The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).
Filesystem Lessions:
- NTFS can hold a folder with large numbers of files but takes forever to enumerate
- When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
- The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
- ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
- I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
- One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)
Robocopy Options:
- Robocopy lets you remove the console logging, with /LOG:output.txt
- Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.
To investigate:
- Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping.
- One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead
I dont how many threads robocopy’s multi threading tool can handle, sometimes it performs really well sometimes and other times its performance is very poor. I am now using GS Richcopy 360, which claims to provide 100% multi threaded file transfer. I have compared it with robocopy and other programs and found that GS Richcopy 360 is faster than them. Surely recommended from my side. Go on and try it!
gs rich copy is very bad
robocopy and teracopy is best…gs richcopy is vast tool i ever used
nice article