

LZIP FILE SPARK HOW TO
I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers ( example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. Unfortunately, zip is not on the supported list by default. So the above formats and much more possibilities could be achieved simply by calling: sc.readFile(path) Source : List the available hadoop codecs This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec ( docs) name | ext | codec classīzip2 |. txt"), and textFile("/my/directory/.gz"). For example, you can use textFile("/my/directory"), textFile("/my/directory/. In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?Īll of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. Where C:input points to a directory with multiple files.

JavaRDD FirstRDD = Ctx.textFile("C:input).cache() JavaSparkContext Ctx = new JavaSparkContext(SpkCnf)

In addition, Spark supports creating a partition for every file under a specified folder, like in the example below: SparkConf SpkCnf = new SparkConf().setAppName("SparkApp") I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working. We will iterate through all of the zip files using Sparks binaryFiles primitive, and for each file use python zipfile to get to the contents of the tsv. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully. I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as.
