scala - Custom input reader in spark -
I'm new to Spark and would like to load page records in RDD from a Wikipedia dump
I Thaup tried to use a record reader provided in streaming but could not understand how to use it. Has anyone made me a good RDD with the page record to make the following code?
Import org.apache.hadoop.io.Text Import org.apache.hadoop.streaming.StreamXmlRecordReader Import org.apache .hadoop.mapred.JobConf Import org.apache.spark.SparkConf Import org.apache.spark.SparkContext objection WikiTest {def main (args: array [string]) {// configuration val sparkConf = new SparkConf () .setMaster ("local [4]") .setAppName ("WikiDumpTest") Val JobConf = new JobConf () jobConf.set ("input", "enwikisource-20,140,906-pages-article-multistream.xml") jobConf.set ("stream.recordreader .class", "org.apache.hadoop.streaming.StreamXmlRecordReader ") JobConf.set (" stream.recordreader.begin "" & lt; page & gt; ") jobConf.set (" stream.recordreader.end "" & lt; / page & gt; ") val sparkContext = new SparkContext (SparkConf) // wikiData = spark Context.hadoopRDD (jobConf, classOf [StreamXmlRecordReader], classOf [text], classOf [text]) // count lines println (reading data wikiData.count)}} Refused to use the Ark's StreamXmlRecordReader. I get the following error:
[Error] found: Class [org.apache.hadoop.streaming.StreamXmlRecordReader (classOf [org.apache.hadoop.streaming.StreamXmlRecordReader])
[Error] Required: Class [? _Up & lt;: org.apache.hadoop.mapreduce.InputFormat [,]]
[Error] classOf [StreamXmlRecordReader]
If I warn the eclisis Ignore and launch the Pregram, so I hit java.lang.ClassNotFoundException.
You get the classOf [org.apache.hadoop.streaming.StreamInputFormat] Instead of classOf [StreamXmlRecordReader] . should use Java. Lang.ClassNotFoundException is because you can run your class WikiTest , but it does not exist because it can not be compiled.
Comments
Post a Comment