feat: HyperANF implementation#841
Conversation
- HyperANF - bump versions in CI matrix
|
@james-willis I will add python bindings (connect/classic) and docs after pre-approve of the API design. |
I was thinking a lot... It looks like this one is still neccessary.
|
The final goals are HyperBALL, approximate closeness centrality, etc. All of these are just simple transformations on top of the HyoerANF. Should I add these in the same PRs or in follow-up PRs? |
| val hop0func = udf(HyperANF.hll(lgNomEntries)) | ||
| var state = edges | ||
| .groupBy(col(GraphFrame.SRC).alias(GraphFrame.ID)) | ||
| .agg(hll_sketch_agg(GraphFrame.DST, lgNomEntries).alias("hop_1")) |
There was a problem hiding this comment.
Is it important to make sure we always pass the same type into the hll sketch functions? on 157 we convert to string so maybe we should do that here as well
There was a problem hiding this comment.
related to this, how does this function deal with cycles? do we need a test for this case where there is a cycle to the hop 0 node?
There was a problem hiding this comment.
Cycles are not a problem. We are limited by hHops. When user do union + estimate all the cycles will be gone.
| publishArtifact := false | ||
|
|
||
| lazy val commonSetting = Seq( | ||
| libraryDependencies ++= Seq( |
There was a problem hiding this comment.
Please add org.apache.datasketches.hll.HllSketch here
There was a problem hiding this comment.
Why? I mean it is a part of the Spark Runtime.
| col(GraphFrame.DST) === col(GraphFrame.ID), | ||
| "left") | ||
| .groupBy(col(GraphFrame.SRC).alias(GraphFrame.ID)) | ||
| .agg(hll_union_agg(s"hop_${hop - 1}").alias(s"hop_${hop}")) |
There was a problem hiding this comment.
hll_union_agg(s"hop_${hop - 1}") will return null if hop_n is null. we should probably handle this with a coalesce to some null sketch?
There was a problem hiding this comment.
Tbh I don't see how it can be null. It can be empty and this is handled correctly. But how can it be null (except some vertex-id is null?) P.S. null vertex IDs are considered as an invalid graph: at the moment most of GF algorithms will just fail on null-ids and handling it is very expensive (full-scan).
There was a problem hiding this comment.
Let me check this.
There was a problem hiding this comment.
You are right: there can be nulls. From the other side.
This code:
(
spark
.createDataFrame([(1, None), (1, None), (1, None)], schema="k: int, v: binary")
.toDF("k", "v")
.groupBy("k")
.agg(F.hll_union_agg("v").alias("v"))
.select(F.hll_sketch_estimate("v").alias("v"))
.show()
)returns
+---+
| v|
+---+
| 0|
+---+
so it is not a problem actually.
There was a problem hiding this comment.
I added a test for that case.
What changes were proposed in this pull request?
Why are the changes needed?
Close #840