admin管理员组文章数量:1429091
Background
I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.
Naive approach
So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....
Naive approach v2
My second try was to follow a bit the idea behind MapReduce.
So, I would slice this huge file in several parts, and order each one. Then I would bine all the parts into the final file.
The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....
Objective
My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.
Question
What are the popular solutions/algorithm to order data sets this big?
Background
I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.
Naive approach
So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....
Naive approach v2
My second try was to follow a bit the idea behind MapReduce.
So, I would slice this huge file in several parts, and order each one. Then I would bine all the parts into the final file.
The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....
Objective
My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.
Question
What are the popular solutions/algorithm to order data sets this big?
Share Improve this question edited May 22, 2018 at 16:01 bfontaine 20k14 gold badges80 silver badges121 bronze badges asked May 22, 2018 at 15:53 Flame_PhoenixFlame_Phoenix 17.7k40 gold badges146 silver badges284 bronze badges 12- Using only js or node? Do you have a database? – Jacob H Commented May 22, 2018 at 15:56
- 1 Why did loading whole file into memory not work? Too much memory usage or too slow to sort? – juvian Commented May 22, 2018 at 16:03
- 3 For data sets too large to fit in memory, the usual means of dealing with them is to use a disk-based system so most data remains on disk e.g. a database. The alternative is to design your own disk-based system which will likely be both a lot more work and a lot slower than a professionally built and managed database. It would seem you could create a table in the database that is indexed by your desired key, then insert all the records into the database, then create a cursor sorted by your key and iterate through all the records in the cursor, outputting to the final file. – jfriend00 Commented May 22, 2018 at 16:11
- 2 en.wikipedia/wiki/External_sorting – juvian Commented May 22, 2018 at 16:14
- 1 If you're on linux or unix, try using the sort mand. – Matt Timmermans Commented May 22, 2018 at 17:03
2 Answers
Reset to default 7What are the popular solutions/algorithm to order data sets this big?
Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.
You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.
Database
You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.
Chunked Memory Sort, Manual Merge
Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).
As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia/wiki/External_sorting.
One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.
Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're paring since it's now done. When there is no more data, you're done.
Sort Keys only in Memory
If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.
Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.
BTree Key Sort
If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.
Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.
I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.
本文标签: javascriptHow to order huge ( GB sized ) CSV fileStack Overflow
版权声明:本文标题:javascript - How to order huge ( GB sized ) CSV file? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745447126a2658714.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论