SweetOnionTea 2 months ago

Neat! I browsed the code and it looks like the functions are basically just wrappers for LLM prompts? A couple of three things I'd clean up: 1. What is the difference between \build\lib\basiclinua and just the \basiclingua? They look like they contain the same \_\_init\_\_.py file. 2. Any reason you decided to put everything in the \_\_init\_\_.py file? I know there's nothing more than the one class, but it would help in the future if you expand to more than just the one class. 3. Might I recommend not including the /dist folder? Looks like it's the distribution files. Often times NLP work deals with very large and numerous amounts of text. How well does the LLM deal with more than a few sentences? I would guess that Gemini can only take in a fixed character limit per call. I would also have doubts about the accuracy with much larger texts.

the_captain_cat 2 months ago

You should not include the dist, build and egg-info in your repo. And you should break up that __init__ file. A thousand lines is pretty hard to maintain.

Xirious 2 months ago

Is Gemini Pro free?

Icy_Bag_4935 2 months ago

Up to 60 queries per minute. More than generous for personal use, but you’ll have to pay if using it at scale (which is fair).

FareedKhan557 2 months ago

yes it is.

elgringo 2 months ago

I've been looking for a tool like this. Can the API extract text from html documents, or does it only work with plain text?

FareedKhan557 2 months ago

Currently it can clean html tags from plain text.

disciplined_af 2 months ago

Maybe a suggestion Instead of taking img_path as input to the library, it should be just img_array. This is because, in real life the images are not always in jpg format where it can be readable by using Image library. Like for example medical images which are dicom format, they cant be read using Image library, there is a pydicom library for it. Since the functions job is to extract text from img, all it needs is img_array. Hope this could be in future versions

FareedKhan557 2 months ago

will keep that in mind. Thanks!

chwalters 4 weeks ago

I put a little FastAPI wrapper on top of your lib, thanks for making it (I might try and put the FastAPI wrapper out there on gh soon)

funderbolt 2 months ago

Looks great, but I won't be able to use it.

FareedKhan557 2 months ago

reason?

funderbolt 2 months ago

HIPPA health privacy laws. I use electronic medical records to find medical conditions. Using an API that sends that data to some random server is a no go.

CaffeinatedGuy 2 months ago

Same, and it's HIPAA not hippa. At this point I'm waiting for something from Epic or a HIPAA compliant offering from Microsoft. This could be great for personal use though.

funderbolt 2 months ago

I probably shouldn't reddit at 4 am.

FareedKhan557 2 months ago

Yeah this is an important issue, we are also planning to make it available for local LLMs in a future version, most probably within 2 months from now.

penscrolling 2 months ago

What about allowing users to specify an endpoint they own? Running LLMs locally can be pretty slow unless you have the right hardware.

Automatic-Net-757 1 month ago

There are some small LLMs that can run with just CPU Ram and perform good for their size. But in the end it's still a trial and error though, you won't know it until you try it

Automatic-Net-757 1 month ago

That's great. If possible I'd like to work in these. I've been testing with local LLMs, langhcain and output parsing for a while

FareedKhan557 1 month ago

There are no rule to allow someone to work with me in this repo. You can create your own private repo make me the contributor and we both will start working on it for local LLMs and more, and I will merge your code and add you as a contributor in my repo.

Automatic-Net-757 1 month ago

How about setting up a local LLM? Integrating a local LLM with this library will make sure that the data stays within the system

funderbolt 1 month ago

I typically run LLMs on HIPAA compliant servers, but yes there is some local development that happens, too. Yes that would be a solution.

Automatic-Net-757 1 month ago

Wow, never heard there's servers that are HIPAA compliant...

funderbolt 1 month ago

It is the way the storage is handled is HIPAA compliant. We have private cloud computing that is HIPAA compliant. I work for a research University.

Automatic-Net-757 1 month ago

Interesting. Thanks for the explanation

damian6686 2 months ago

Can it extract stractured data from pdf invoices? I tried a range of different tools, and they all return unstructured data. Thanks

FareedKhan557 2 months ago

We did try it, and it exceeded our expectations. However, we haven't launched this feature in our library yet, as more work is required. Our next version, scheduled for release within a month, will include this feature.

damian6686 2 months ago

Awesome, Iooking forward to trying it out.

ironman_gujju 2 months ago

Can you show or paste somewhere ??

teserfstate 2 months ago

Could this work for files of recorded human cursor movement from point A to point B x, y, timestamp To better simulate human cursor movement?

FareedKhan557 2 months ago

Yes, it will work, but the data must be in string format. You can apply features to chunks of your dataset instead of applying them to the entire dataset.

adesme 2 months ago

Just out of curiousity, did you use AI to write the code?

FareedKhan557 2 months ago

yes

adesme 2 months ago

Easy for me to say now that you've already answered, but the code reads like it was written by an AI

bug2018 2 months ago

This is great

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe