Forums

This forum is for general programming, development related discussions.

Discuss about cloud computing technologies, learning resources, etc. 

Discuss big data frameworks/technologies such as Hadoop, Spark,  etc. 

Any Kontext website related questions, please publish here incl. feature suggestions, bug reports, other feedbacks, etc. Visit Help Centre to learn how to use Kontext platform efficiently. 

starsTop contributors
# User web_assetArticles forum Threads comment Comments
1 551 8 176
2 52 5 7
3 3 0 0
4 1 0 6
5 0 2 0
6 0 1 0
7 0 1 0
8 0 0 11
9 0 0 8
10 0 0 8
#1548 Re: Spark 3.0.1: Connect to HBase 2.4.1 access_time20d

Please contact us via: Contact us and we will try to arrange a Teams session for you.

#1547 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time21d

I understand your problems very well now. I think you are confused by a) HDFS path vs. a local PATH; 2) accessing/writing data using PySpark vs. pure Python. 

There are two problems with your code:

  1. As I explained before, open is a pure Python function and it can only be used to read local files on the node and it cannot read HDFS path. This was why I suggested you to pass the file in submit command and then the file will be passed to driver and executor containers and then you can use open command to read it. However then my question would be what is the purpose of using Spark if all your input and output are done locally?
  2. When you write it, you will also write it into the node server where your Spark master applications resides. 

The node in the above two points can be randomly picked up thus even you can write the file successfully you won't be able to retrieve easily.

Thus to resolve your problem:

1) Read using Spark.read from HDFS not open. Or you may try some HDFS python library (without Spark) however I won't recommend it as you may hit some problems (permission setup) and I have never used this library:

import pydoop.hdfs as hdfs

with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

Another Python library you can potentially use it hdfs.

2) When you write the data, there several possible ways:

A) - You can only write Spark with the supported format and the one you used is not supported. I am not sure about the format you mentioned, if you can use CSV, JSON, etc. to save the file into HDFS using DataFrameWriter (df.write) APIs, you can then use HDFS command or pure Python HDFS client libraries to copy the file into local server.

B) - Write Spark DataFrame into a database using JDBC and then retrieve the data using Python. The retrieve script need to run in local Python environment instead of using PySpark.

C) - It's possible to customize Spark writer from Spark 3.0 but if are not familiar with Spark APIs, I won't recommend this. 

#1546 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time21d

Hi Raymond!

I can share the overall structure of my code.

1) Importing required libraries

2) creating spark session

spark = SparkSession.builder.appName('God').getOrCreate()

3) loading data from database to data frame

Using Spark.read()-->successfully able to pull data

4) cleaning the data and doing transformation

It is also successful. you can mark my word on this.

5) and after transforming the data(Internal use), trying to write it in a file and want to store the output to the desired location.

Failing here

with open('path/to/hdfs/filename') as file:#also tried with local path

       file.write(g.serialize(format='turtle'))#Please ignore on what is                                                                         written inside file.write

It shows No such file or directory exists. but in client mode, it is able to create the file provided by the local path. I feel open() is not able to look outside current pwd.

On other pages everyone is using 'df.write.format' but I want my output to be written to turtle file which is neither text,CSV, parquet etc.

Sorry, I can't share the code. But I hope I have tried to explain as much as possible. If still any doubt/question please reply here.


#1545 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time21d

Hi Venu,

The code example you provided to me are local file write which has nothing to do with Spark:

with open("/user/user_name/myfile.ttl",mode='w+') as file:# It's a turtle file.

    file.write("This is truth")

The above lines will run in driver application container in the Spark cluster. 

That is why I made the comments before.

For me to illustrate more, can you share complete script if it is okay?

#1544 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time21d

Hi Raymond

Thanks for the reply!

I have some doubts. (.txt) is just an example but actually, I want to store .ttl type of file(turtle file)Want to store RDF(Resource descriptive framework) Triples. I don't want to read the file. I already have read the data using Spark. read and stored it in a data frame. After transforming the data I just want to write the output of the program to a file in spark-cluster mode.

Note:- I have tried your suggestion but still it gives the same error.

Can you please provide some more detailed explanation/solution?

#1543 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time24d

Hi Venu,

When you run the job locally, your Python application can reference the local file path that your master can reach.

When you submit the job to run in a cluster and also master container is in the cluster, you can only reference the local file paths in the server where master container is spin up.

So to fix your issue, you can upload your file into HDFS and use spark.read APIs to read the data; alternatively, you can pass the file when you submit the application as I did in this article:

--py-files file.txt

In your code, you can reference it with path file.txt.

I would recommend uploading the file into HDFS first. 

#1542 Re: Run Multiple Python Scripts PySpark Application with yarn-cluster Mode access_time24d

Hi,

I'm running this script in spark cluster mode on a server.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('God').getOrCreate

with open("/user/user_name/myfile.ttl",mode='w+') as file:# It's a turtle file.

    file.write("This is truth")

Trying to run this script in spark cluster mode

spark-submit --master yarn --deploy-mode cluster h1.py

Getting error: No such file or directory exists. I have provided the correct path. checked several times.

Even though I have checked that the directory exists & also tried with different hdfs paths too.

code works perfectly fine in client mode. It seems that the executor node is not able to find the mentioned path. Can we use with open to write files in cluster mode? If not then how to write files in cluster mode.

Kindly help me on this.



#1541 Re: Spark 3.0.1: Connect to HBase 2.4.1 access_time1m

Yes, they all are in current directory. Can we connect if possible?

#1540 Re: Spark 3.0.1: Connect to HBase 2.4.1 access_time1m

Are all those jars included in the current directory where you initiated the spark-shell?

You can manually put them into \jars directory in your Spark installation.