Region Error when trying to access Google Cloud Bigtable with Spark from a Jupyter Notebook
我正在尝试从运行 PySpark 内核的 Jupyter Notebook 中运行对 Google Cloud Bigtable 的并行访问。我以 http://ec2-54-66-129-240.ap-southeast-2.compute.amazonaws.com/httrack/docs/cloud.google.com/dataproc/examples/cloud-bigtable-example 为例.html 并且我正在使用我的特定项目/区域/集群/表名称。身份验证通过在 spark 上下文中广播的服务帐户凭据进行。
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
jconf = {“hbase.client.connection.impl”:“com.google.cloud.bigtable.hbase1_1.BigtableConnection”,
“google.bigtable.project.id”: myProject, “google.bigtable.zone.name”: myZone, “google.bigtable.cluster.name”: myCluster, “hbase.mapreduce.inputtable”: myTable} keyConv =“org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter” hbase_rdd = sc.newAPIHadoopRDD( hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(“\ print(“Row count: %s” % hbase_rdd.count()) |
我收到以下错误:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-30-55b05ded0d2b> in <module>() 21 #keyConverter=keyConv, 22 #valueConverter=valueConv, —> 23 conf=jconf) 24 25 hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(“\ “)).mapValues(json.loads) /usr/lib/spark/python/pyspark/context.pyc in newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize) /usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) /usr/lib/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) /usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. |
从运行 Jupyter 笔记本的终端,我可以毫无问题地访问 GCloud 上的 Bigtable 实例。此外,google.cloud.bigtable 和 google.cloud.happybase 连接器在同一个 Jupyter 笔记本中工作正常(但它们不处理对 Bigtable 的调用的先验并行化)。
知道我在这里可能做错了什么吗?
仅供参考,我正在使用 Spark 2.0.2、Hadoop 2.7.3、Python 2.7.12、google-cloud-bigtable 0.26.0、com.google.cloud.bigtable:bigtable-hbase-1.1:0.2。 2 在 Google dataproc 集群上。
非常感谢,
乔治
编辑:
按照 Igor Bernstein 的建议进行编辑后,我收到了一个新错误:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-5-4f0d8b1fb126> in <module>() 23 #keyConverter=keyConv, 24 #valueConverter=valueConv, —> 25 conf=jconf) 26 27 hbase_rdd = hbase_rdd.flatMapValues(lambda v: v.split(“\ “)).mapValues(json.loads) /usr/lib/spark/python/pyspark/context.py in newAPIHadoopRDD(self, inputFormatClass, keyClass, valueClass, keyConverter, valueConverter, conf, batchSize) /usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in __call__(self, *args) /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) /usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. |
您使用的是什么版本的 bigtable-hbase?可以试试最新版本吗? bigtable-hbase-1.x-hadoop:1.0.0-pre3?另外请更新您的配置如下:
- “hbase.client.connection.impl”:”com.google.cloud.bigtable.hbase1_x.BigtableConnection”
-
删除 “google.bigtable.zone.name”
- 嗨@igor-bernstein,感谢您的回答!我收到了一个新错误,并且已将信息包含在问题本身中。
- 至于示例代码,我不知道它来自哪里(你滚动到网页底部了吗?)
来源:https://www.codenong.com/46470444/