-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command for spark-submit #3
Comments
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying It is a required dependency for Kafka and Spark Streaming to work. |
I see! Can you explain what the colon symbol (:) means? |
I guess that is just for separation. It would mean that we want the |
Thanks a lot! I actually have another unrelated question. I have been struggling to find out how EventSim and Kafka work together to create the 4 topics: listen_events, page_view_events, auth_events, status_change_events. Is that hard-coded and only works for the Million Song Dataset? I also saw that each event has different columns, so you assign different Spark schemas for them. |
Right. Eventsim is currently coded to generate events for a music streaming website. You can change that but that is written in Scala. So if you have experience with that, why not. The topics are also currently hardcoded. You can check the codebase here |
I see. How did you check the columns and the data types of each Kafka event before you define the schemas? |
From the confluent control center UI. It should be available on port 9021 of your VM. You will have to forward that port to your local machine and open it in the browser. |
I didn't know you can check those out from the control center. Thanks for being helpful. |
No worries. If you check the youtube video I put out a few days back, you should get an idea of how to navigate through the project. You'll find the link in the README |
Hi. Do you mind answering another question of mine? I hope this is not bothering you. When I set the environmental variable |
Hey, no worries, happy to answer your questions. I am not sure why this would happen. Eventsim is running on the same VM as Kafka, hence it should pick up directly via that. Eventsim doesn't refer to the |
Your Kafka broker will be available at that port for other applications to connect. |
My mistake. It was opened automatically. (Sorry I deleted the comment after I realized that). Now it works. But let me try again after setting the env variable. |
Yeah it doesn't work if I set the This is what I see in the terminal:
|
By the way I just saw this in the Kafka docker compose yaml file
So the KAFKA_ADDRESS env variable does affect stuff. |
Yes, it does. If you want to connect from an external VM, like the Spark cluster in our case, you need to write to the external IP address instead of the localhost. Hence I added the KAFKA_ADDRESS variable in the docker-compose.yml. |
I just noticed this in the terraform config:
I didn't follow everything in your project so I missed this part. Is this the reason why it doesn't work? But without this step, it does open up port 9092 on my VSCode when I SSH to the Kafka VM though. Or they are totally different things? |
Hey, when I set the port rules, it worked! It turns out this step is important. |
So on VS Code, you are forwarding the port. It is not the same as opening the port. You are able to forward the port to your local computer because you already authenticated when you did an SSH. On the other hand, you open your port to allow connections from some other VM in the network. By default, no ports are accessible to any other VMs inside or outside the network. You need to specify, which port would you like to open to accept connections. |
Ah I see! This explanation is helpful. |
I have another question. I see that you set up a Dataprocs server for Spark streaming. Do you think it makes sense if I do everything on a single VM and hence serve it locally? So I was thinking everything would involve "localhost" and the master node would be "local[*]" etc. |
You can. However, In the real-world Spark would always be running on some kind of a cluster. So to make things run on the cluster would be a wiser choice if you plan to showcase the infra choices in an interview or somewhere else. |
I see. Yeah you made a great point. |
By the way, I am not sure this is a mistake on my behalf, but I think I might have found some potential bugs:
The relevant change I made is the removal of the environmental variable
Otherwise, I couldn't make it work. (Ignore my reply a few days ago, I too thought it worked when I set the env variable to the external IP of the VM, but it actually did not work. My bad.)
I think there is a mistake in the Spark version. In the Spark README, the version 3.0.3 is installed. When I changed it to:
it finally worked. |
Hey, apologies for the late reply. There's a lot going on at this point and I kinda missed this comment.
|
No worries. But I wonder if it's fine if we talk somewhere other than here? I also miss comments here sometimes. Perhaps Slack is good, but I am fine with anything. I now have questions regarding the dbt + Airflow part. I am very new to SQL, so I really need your help to understand the logic behind the SQL queries you wrote.
|
Sure, feel free to reach out over slack. I assume you must be in the Datatalks slack workspace. For 1 and 2, both are techniques to handle dimensions where the value is null or unknown. It's kinda hard to explain over chat, but the idea is that there should not be any null dimensions in your fact. A null artist represents nothing and also messes up with aggregates and visualizations. Instead, we add a record in the artists dimension to say that all the null artist ids will now be represented by Here's a short explanation on the Kimball group blog about the same. |
I have a question regarding the command for spark-submit:
What is the meaning of the packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
?The text was updated successfully, but these errors were encountered: