![]() What if we save the stdout to a CSV file, then split it and compress the chunks. The entire file is also required for split to work with the - number option. But \COPY TO streams the standard output to the PROGRAM, so we don’t have the entire file yet. Since we know how many chunks we want (let’s say 8), we can call split with the - number=l/8 option to get exactly that many chunks without splitting the lines. For example, we can split a file into smaller files, based on the number of lines/bytes etc. Split has many options on how we want to split our file. To split the file into smaller chunks, we can use the unix split program. The number of these files should be a multiple of the slices in our cluster. The AWS Redshift advisor prompts us to split the CSV file into multiple (equal size) files so that the COPY command can load the data in parallel into the corresponding table. Our export script, with compression, will look like this: In order to compress the CSV file, we can take advantage of the TO PROGRAM option of the \COPY command and execute the compression on the DB server (check the docs for more information). We will go with the first approach in order to avoid overloading the DB server. Note that the only difference between these approaches is the /COPY vs COPY command. Also, the second approach requires the DB user elevated privileges as we discussed in the first section. ![]() In the second approach, the compression is performed on the DB server side, which means that we will transfer compressed data between the DB and the sync service, but since compression is a CPU intensive task in general, it will put some extra load on the DB server. For large amounts of data, this could slow down the export process. In the first approach, the compression is performed on the client-side, which means that the uncompressed data is transferred from the DB server to the sync service. There are pros and cons for each approach. ![]() We can either export the CSV file, store it locally on the sync service and then compress it or we can perform the compression during the export on the DB server. Compression will speed up the process since the amount of data uploaded to the S3 bucket will be reduced. ![]() Step 1: Compress the CSV fileĪs a first improvement, we can try to compress the exported CSV file using gzip. Although this approach works for relatively small tables, the performance degrades as the data grows. ![]()
0 Comments
Leave a Reply. |