Source data:
- ~500,000 folders (court cases)
- ~2.5-3 million documents
- Source drives is replicated x2 with RAID
- Copying to NAS over GB ethernet
- Initial un-tuned copy was set to take ~2 weeks (after switching to Robocopy – before, it was painful just to do an ls)
- Final copy took ~24 hours
Monitoring:
- Initially I saw 20-40 Kbps in traffic in DD-WRT, clearly too low. After some changes this is still generally low, but with spikes up to 650 Kbps.
- CPU use – 4/8 cores in use, even with >8 threads assigned to Robocopy
- In Computer Management -> Performance monitoring, the disk being copied is Reading as fast as it can (set to 100 all the time)
- The number called “Split IO / second” is very high much of the time. Research indicates this could be improved with defrag (though this might take me months to complete).
Filesystem Lessions:
- NTFS can hold a folder with large numbers of files but takes forever to enumerate
- When you enumerate a directory in NTFS (e.g. by opening it in Windows Explorer), Windows appears to lock the folder(!) which pauses any copy/ls operations
- The copy does not appear to be I/O bound – even setting Robocopy to use many threads, only 4/8 cores are in use at 5-15% per each.
- ext4 (destination system) supports up to 64,000 items per folder, any more and you get an error.
- I split all 500k items into groups of 256*256 at random (for instance one might open \36\0f to see a half dozen items). These are split up using md5 on the folder names – basically this uses the filesystem as a tree map.
- One nice consequence of this is that you can estimate how far along the process is by looking at how many folders have been copied (85/256 -> 33%, etc)
Robocopy Options:
- Robocopy lets you remove the console logging, with /LOG:output.txt
- Robocopy lets you set the number of threads it uses. By default this is 8, it seemed to run faster with > 8, but only the first few threads made any difference.
To investigate:
- Ways of using virtual filesystems – it’d be nice to continue using wget to download, but split up large folders into batches for scraping.
- One possibility is to use wget through VirtualBox, since there are more linux based virtual filesystems – not sure on the performance ovehead