Skip to content

feat: HyperANF implementation#841

Open
SemyonSinchenko wants to merge 7 commits into
graphframes:mainfrom
SemyonSinchenko:840-hyperANF
Open

feat: HyperANF implementation#841
SemyonSinchenko wants to merge 7 commits into
graphframes:mainfrom
SemyonSinchenko:840-hyperANF

Conversation

@SemyonSinchenko

Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

  • HyperANF
  • bump versions in CI matrix

Why are the changes needed?

Close #840

@SemyonSinchenko SemyonSinchenko changed the title [WIP] feat: HyperANF implementation feat: HyperANF implementation Jun 2, 2026
@SemyonSinchenko

Copy link
Copy Markdown
Collaborator Author

@james-willis I will add python bindings (connect/classic) and docs after pre-approve of the API design.

I was thinking a lot... It looks like this one is still neccessary.
@SemyonSinchenko

Copy link
Copy Markdown
Collaborator Author

The final goals are HyperBALL, approximate closeness centrality, etc. All of these are just simple transformations on top of the HyoerANF. Should I add these in the same PRs or in follow-up PRs?

@SemyonSinchenko SemyonSinchenko self-assigned this Jun 9, 2026
val hop0func = udf(HyperANF.hll(lgNomEntries))
var state = edges
.groupBy(col(GraphFrame.SRC).alias(GraphFrame.ID))
.agg(hll_sketch_agg(GraphFrame.DST, lgNomEntries).alias("hop_1"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to make sure we always pass the same type into the hll sketch functions? on 157 we convert to string so maybe we should do that here as well

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to this, how does this function deal with cycles? do we need a test for this case where there is a cycle to the hop 0 node?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cycles are not a problem. We are limited by hHops. When user do union + estimate all the cycles will be gone.

Comment thread core/src/main/scala/org/graphframes/mixins.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
Comment thread build.sbt
publishArtifact := false

lazy val commonSetting = Seq(
libraryDependencies ++= Seq(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add org.apache.datasketches.hll.HllSketch here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I mean it is a part of the Spark Runtime.

Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
col(GraphFrame.DST) === col(GraphFrame.ID),
"left")
.groupBy(col(GraphFrame.SRC).alias(GraphFrame.ID))
.agg(hll_union_agg(s"hop_${hop - 1}").alias(s"hop_${hop}"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hll_union_agg(s"hop_${hop - 1}") will return null if hop_n is null. we should probably handle this with a coalesce to some null sketch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I don't see how it can be null. It can be empty and this is handled correctly. But how can it be null (except some vertex-id is null?) P.S. null vertex IDs are considered as an invalid graph: at the moment most of GF algorithms will just fail on null-ids and handling it is very expensive (full-scan).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right: there can be nulls. From the other side.

This code:

(
    spark
    .createDataFrame([(1, None), (1, None), (1, None)], schema="k: int, v: binary")
    .toDF("k", "v")
    .groupBy("k")
    .agg(F.hll_union_agg("v").alias("v"))
    .select(F.hll_sketch_estimate("v").alias("v"))
    .show()
)

returns

+---+
|  v|
+---+
|  0|
+---+

so it is not a problem actually.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for that case.

Comment thread core/src/main/scala/org/graphframes/lib/HyperANF.scala Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: hyperANF

2 participants