javascript - How to order huge ( GB sized ) CSV file? - Stack Overflow-软件玩家

admin管理员组
文章数量:1429091

Background

I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.

Naive approach

So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....

Naive approach v2

My second try was to follow a bit the idea behind MapReduce.

So, I would slice this huge file in several parts, and order each one. Then I would bine all the parts into the final file.

The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....

Objective

My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.

Question

What are the popular solutions/algorithm to order data sets this big?

Background

I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.

Naive approach

So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....

Naive approach v2

My second try was to follow a bit the idea behind MapReduce.

So, I would slice this huge file in several parts, and order each one. Then I would bine all the parts into the final file.

The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....

Objective

My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.

Question

What are the popular solutions/algorithm to order data sets this big?

Share Improve this question edited May 22, 2018 at 16:01 bfontaine 20k14 gold badges80 silver badges121 bronze badges asked May 22, 2018 at 15:53 Flame_Phoenix 17.7k40 gold badges146 silver badges284 bronze badges

Using only js or node? Do you have a database? – Jacob H Commented May 22, 2018 at 15:56
1 Why did loading whole file into memory not work? Too much memory usage or too slow to sort? – juvian Commented May 22, 2018 at 16:03
3 For data sets too large to fit in memory, the usual means of dealing with them is to use a disk-based system so most data remains on disk e.g. a database. The alternative is to design your own disk-based system which will likely be both a lot more work and a lot slower than a professionally built and managed database. It would seem you could create a table in the database that is indexed by your desired key, then insert all the records into the database, then create a cursor sorted by your key and iterate through all the records in the cursor, outputting to the final file. – jfriend00 Commented May 22, 2018 at 16:11
2 en.wikipedia/wiki/External_sorting – juvian Commented May 22, 2018 at 16:14
1 If you're on linux or unix, try using the sort mand. – Matt Timmermans Commented May 22, 2018 at 17:03

| Show 7 more ments

2 Answers 2

Sorted by: Reset to default 7

What are the popular solutions/algorithm to order data sets this big?

Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.

You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.

Database

You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.

Chunked Memory Sort, Manual Merge

Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).

As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia/wiki/External_sorting.

One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.

Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're paring since it's now done. When there is no more data, you're done.

Sort Keys only in Memory

If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.

Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.

BTree Key Sort

If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.

Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.

I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.

本文标签： javascriptHow to order huge ( GB sized ) CSV fileStack Overflow

版权声明：本文标题：javascript - How to order huge ( GB sized ) CSV file? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1745447126a2658714.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - How to order huge ( GB sized ) CSV file? - Stack Overflow

Background

Naive approach

Naive approach v2

Objective

Question

Background

Naive approach

Naive approach v2

Objective

Question

2 Answers 2

更多相关文章

javascript - How to order huge ( GB sized ) CSV file? - Stack Overflow

发表评论

推荐文章

twitter bootstrap - How do you manage the Gutenberg block previews in the admin area?

javascript - bootstrap is not defined when trying to use Bootstrap 5 with Laravel - Stack Overflow

plugins - How to bookmark pages that have dynamic URLs?

javascript - How to clear the mapbox geocoder? - Stack Overflow

javascript - jQuery remove options from select box - Stack Overflow

热门文章

javascript - dynamic height and width of canvas based on selected images - Stack Overflow

javascript - Prevent pinchzoom in Safari for OSX - Stack Overflow

javascript - Shortest way to add CSS rules with vanilla JS - Stack Overflow

javascript - Jquery Chosen plugin. Select multiple of the same option - Stack Overflow

javascript - How to remove the br tag from the end of a string - Stack Overflow

javascript - VueJs manipulate inline template and reinitialize it - Stack Overflow

javascript - Pusher: How to bind to 100s of events? - Stack Overflow

javascript - How to dynamically convert html table rows to columns? - Stack Overflow

html - Using an onblur event to send value to javascript function that will add .00 if it doesnt exist - Stack Overflow

javascript - Listening to Firebase database in AWS Lambda times out - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

functions - How come Featured Image isn&#39;t showing up in my Custom Post Type?

Netbeans doesn&#39;t support javascript - Stack Overflow

JavaFX TableView - Scroll Bars and auto resized TableColumns - Stack Overflow

javascript - jsTree Contextmenu get selected node id - Stack Overflow

javascript - React.js and big table - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

functions - How come Featured Image isn't showing up in my Custom Post Type?

Netbeans doesn't support javascript - Stack Overflow